What is nonblocking progress? Consider the simple example of incrementing a counter C shared among multiple threads. One way to do so is by protecting the steps of incrementing C by a mutual exclusion lock L (i.e., acquire(L); old := C ; C := old+1; release(L);). If a thread P is holding L, then a different thread Q must wait for P to release L before Q can proceed to operate on C. That is, Q is blocked by P.
Now consider an implementation of the increment operation using the
do {old := C; } until CAS(&C, old, old+1);
This implementation is nonblocking because no thread can be blocked by the inaction of other threads (caused by, for example, preemption, a page fault, or even termination), regardless of where the other threads stop.
There are three established levels of nonblocking progress:
Finally,
It is worth noting that these levels of progress guarantees require that primitive
Nonblocking operations are often used for systems or interthread interactions where having threads wait for actions by other threads makes the system vulnerable to deadlock, livelock, and/or prolonged delays. Examples of such uses are:
•Async signal safety. This allows signal handlers of asynchronous signals to share data or services with the interrupted thread, with a guarantee that the signal handler never needs to wait for actions by the interrupted
•
system, the remaining processes never have to wait for
•Soft realtime applications on commodity systems. Using nonblocking operations on such systems (e.g., media players) helps avoid priority inversion and provides preemption tolerance. That is, it eliminates the possibility that an active
Unlike the example of the shared counter, almost any nonblocking operation that involves more than one location presents multiple considerations that are often at odds. This article examines
The goals of this article are to walk the reader through the various issues and features of nonblocking operations and the various choices to be made; to understand the interactions among these options; and to develop a sense of the best path to take to disentangle the interdependencies among these choices and quickly prune options that are incompatible with the main objectives of using nonblocking operations in the first place.
For some uses of nonblocking operations, such as
Figure 1 presents a variant of the classic
Depending on the context, reads and CAS operations on Header are either single- or
Ignoring for now why an integer is packed with the pointer in Header (explained later), the reader can notice that the inaction of any number of threads stopped anywhere in the Push and Pop operations cannot block other threads from operating on the list. In fact, since Push and Pop are not only nonblocking but also
The following are the key issues to consider in selecting the features of nonblocking operations.
The appropriate level of nonblocking progress (e.g.,
To achieve async signal safety using nonblocking operations, only the signal handler's operations need to be nonblocking, while operations by the main threads can be blocking. For example, in a case where signal handlers may search hash tables that may be updated by threads that may be interrupted, a fully nonblocking algorithm is not necessary, whereas an algorithm with nonblocking search operations and blocking add and remove operations (as described by Steve Heller et al.6) can be sufficient.
If the only concern is the interaction between the signal handler and the interrupted thread, then using
For avoiding priority inversion in soft realtime applications, it may suffice to make the high- priority operation
Accordingly, in choosing the appropriate level of nonblocking progress to use, you must take into account what is absolutely needed to achieve the primary requirement (e.g.,
If the number of threads that can concurrently execute certain operations is limited, then strong progress guarantees can be achieved by using simpler algorithms than when higher levels of concurrency are allowed (e.g.,
An issue that is sometimes missed is the effect of read operations on update operations. Consider two buffer implementations. Both support check (if empty) and add operations. In both implementations, operations are
You can see how problematic the latter implementation is when the act of checking if the buffer is not empty can prevent items from ever being added to the buffer. Therefore, in selecting the appropriate level of progress for a nonblocking operation, it is not enough just to require the implementation, for example, to be
The choice of data structures is one of the most important decisions in designing a nonblocking environment. This step is often overlooked because
For example, if
Another example is that a search structure that may be a good choice in sequential code (e.g., balanced trees) may not be the best choice in nonblocking systems compared with hash structures.
Also, static structures such as hash tables with open addressing can be simpler to manage than dynamic structures such as hash tables with chaining.
Nonblocking operations, by definition, do not wait for actions by other nonblocking operations and cannot expect to prevent other nonblocking operations from taking actions. This poses a problem for nonblocking operations that dereference pointers to dynamic structures that could be removed and freed by other operations. Without proper precautions a nonblocking operation may be about to access a dynamic block when the block gets removed from the structure and freed by another operation. This could lead to problematic outcomes such as an access violation if the block were freed back to the operating system, corrupting a different structure that happens to allocate and use the memory of the freed block, or returning an incorrect result.
Using the
Paul McKenney13 offers a detailed discussion of safe memory reclamation issues and solutions. The following is a brief list of categories of
•Automatic GC (garbage collection). On systems with GC, such as Java applications, memory safety is implicitly guaranteed. In the preceding example, as long as thread P holds a reference to block A, block A is guaranteed not to be freed. Of course, this raises the question of whether GC itself is nonblocking.
•RCU
•Reference counting. Blocks are associated with counters that are incremented and decremented as threads acquire and release references to such blocks. Typically, a block is freed only when its reference count goes to zero and it is guaranteed that no new references to it can be created.20
•Hazard pointers. Briefly, in this method each thread that may traverse blocks that may be removed and freed owns a number of hazard pointers. Before dereferencing a pointer, a thread sets one of its hazard pointers to the pointer value. Other threads that may remove the block guarantee that they will not free the block until no hazard pointers point to it.8,16
Hardware transactional memory may simplify nonblocking safe memory reclamation.4 As discussed later, however, most upcoming mainstream HTM implementations are
The options for safe memory reclamation in order of increasing flexibility (and difficulty of
• Freed blocks will be reused but never freed for different uses.
• Freed blocks can be coalesced and reused for different types and sizes but not returned to the operating system.
• Freed blocks may be coalesced, reused arbitrarily, or returned to the operating system.
Note that for some algorithms (e.g., the classic LIFO list), memory safety might not be a problem if the operating system supports nonfaulting loads. In the scenario just mentioned of thread P reading address A in line 2 of Pop after the block at A was freed to the operating system, a system with nonfaulting loads would allow such a read.
The ABA problem is common in optimistic concurrency algorithms, including nonblocking ones. The term ABA refers to the change of the value of a shared variable from A to B and back to A again. Using the
The classic LIFO algorithm packs an integer with the pointer in the Header variable and is designed such that the counter will change if the list changes between lines 1 and 3 of Pop (assuming that the counter never overflows).
Packing an integer with the main content of variables vulnerable to the ABA problem is the classic ABA prevention solution.2 It remained the only solution for decades, but it has its limitations. It requires wide atomic operations to allow space for a large enough integer that cannot overflow (or at least cannot overflow in the lifetime of one operation, such as Pop in the example). Another disadvantage of this solution is that it requires the integer subfield to retain its semantics indefinitely. This can make it very difficult to free the memory of dynamic blocks that include variables packed with
Although the ABA problem can occur in algorithms that do not use dynamic memory, its solutions are intertwined with safe memory reclamation solutions. First, as already mentioned, the classic ABA solution can hinder memory reclamation. Second, any
An advantage of the RCU type of solution is that traversal operations have almost no overhead, while reference counting and hazard pointers have nontrivial overhead for traversal operations. On the other hand, unlike RCU, reference counting and
A disadvantage of reference counting that can be significant in some cases is that it can cause a supposedly
The operations LL
Actual architectures that support LL/SC (e.g., IBM Power PC) do not support the full semantics of ideal LL/SC/VL. None supports the VL operation, and all impose restrictions on what can be executed between LL/SC and prohibit the nesting or interleaving of LL/SC pairs on different locations. So, while actual LL/SC support can help the
While implementations of ideal LL/SC/VL present an absolute ABA prevention solution,18 their strong semantics disallow many correct scenarios. For example, all ABA scenarios in the
It is advisable to consider ABA prevention and safe memory reclamation solutions together to avoid unnecessary duplication of overhead or solutions that are contradictory or contrary to the overall system requirements.
The range of hardware support for atomic operations needed for nonblocking algorithms and methods varies significantly. If portability is an important issue, designers need to take that into account in selecting data structures, algorithms, and supporting methods for safe memory reclamation and management, and ABA prevention.
Hardware support requirements include:
•No support. For example, the reading and writing of hazard pointers can be nonatomic.16
•Nonuniversal atomic operations (such as
•
•The pair LL and SC (e.g., larx and stcx on the IBM Power architecture). These were also shown by Herlihy to be universal operations. As already discussed, they are immune to the ABA problem in some cases in which CAS is susceptible. Therefore,
•Hardware transaction memory. Recently IBM (Blue Gene/Q,21 System Z,12 and Power3) and Intel11 architectures are offering hardware support for arbitrary memory transactions. However, most of these HTM architectures (except IBM System Z) are
Note that if the number of threads that can concurrently execute certain operations is limited, nonuniversal atomic operations may suffice to design
In addition to the variety of requirements of atomic operations, nonblocking algorithms vary in their requirements of
Prior to Java 5 (2004) and C11/C++11, these languages could not be reliably used (as specified by their standards) to fully express the required memory ordering. Programmers and custom library writers relied on assembly language and machine binary codes to express such ordering.
Now with C11/C++11 and Java (5 and later), programmers can designate some variables as subject to special ordering (volatiles in Java and atomics in C11/C++11). In some cases, these designations can be
The implementations of such languages have varying overheads on different hardware platforms. Designers of nonblocking implementations should take into account the choice of
The combined choice of data structures and algorithms is one issue where big compromises can be made to design a system that meets its overall requirements.
Nonblocking algorithms vary in their portability (e.g., requirement of special hardware support), reliability (e.g., whether or not they are widely used), complexity (e.g., reasonable ease of implementation, maintenance, and modification), progress guarantees (e.g.,
The purpose of the following example is to demonstrate the interactions among nonblocking features and issues discussed in this article.
Consider a simplified example of a
Considering that the records can be moved back and forth among data structures, any nonblocking algorithm will have to deal with the ABA problem. Since the records can be large and variably sized, limiting the reuse of their memory is not an acceptable option. The classic ABA solution (used in the
This may be a good time to consider a compromise. How about adding a separate descriptor to each record? The descriptor holds the key value and any fields needed for traversing the data structures; the fields then may be changed in place without removing and adding the records to the structure. With this compromise the memory of the full records can still be freed, but it might be acceptable to keep descriptors from being freed for arbitrary reuse.
Now let's reconsider the ABA problem. Maybe the classic
Now the designer is left with the task of balancing the pros and cons of sequential performance (i.e., when using packed integers with
As the reader can see, this case study goes through a few iterations over the various choices, starting with what is absolutely necessary, making a compromise (using descriptors and adding a level of indirection), and finally reaching a reasonable set of options for further consideration of their pros and cons.
This article has presented the key issues that can help designers of nonblocking systems make
The author thanks Samy Al Bahra and Paul McKenney for their insightful comments on earlier versions of this article.
1. Attiya, H., Hendler, D. 2005. Time and space lower bounds for implementations using
2. Brown, P. J., Smith, R. M. 1973. U.S. Patent 3,886,525. Shared data controlled by a plurality of users (filed June 1973).
3. Cain, H. W., Frey, B., Williams, D., Michael, M. M., May, C., Le, H. 2013. Robust architectural support for transactional memory in the Power architecture. ACM/IEEE 40th International Symposium on Computer Architecture (ISCA).
4. Dragojevic, A., Herlihy, M., Lev, Y., Moir, M. 2011. On the power of hardware transactional memory to simplify memory management. 30th Annual ACM
5. Fatourou, P., Kallimanis, N. D. 2011. A highly efficient
6. Heller, S., Herlihy, M., Luchangco, V., Moir, M., Scherer III, W. N., Shavit, N. 2005. A lazy concurrent
7. Herlihy, M. 1991.
8. Herlihy, M., Luchangco, V., Martin, P. A., Moir, M. 2005. Nonblocking memory management support for
9. Herlihy, M., Luchangco, V., Moir, M. 2003.
10. IBM System/370 Principles of Operation. 1975.
11. Intel Architecture Instruction Set Extensions Programming Reference. 2012.
12. Jacobi, C., Slegel, T. J., Greiner, D. F. 2012. Transactional memory architecture and implementation for IBM System Z. 45th Annual IEEE/ACM International Symposium on Microarchitecture:
13. McKenney, P. 2013. Structured deferral: synchronization via procrastination. ACM Queue. http://queue.acm.org/detail.cfm?id=2488549
14. McKenney, P. E., Slingwine, J. D. 1998.
15. Michael, M. M., Scott, M. L. 1996. Simple, fast, and practical nonblocking and blocking concurrent queue algorithms. Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing:
16. Michael, M. M. 2004. Hazard pointers: safe memory reclamation for
17. Michael, M. M. 2004. Scalable
18. Michael, M. M. 2004. Practical
19. Timnat, S., Braginsky, A., Kogan, A., Petrank, E. 2012.
20. Valois, J. D. 1995.
21. Wang, A., Gaudet, M., Wu, P., Amaral, J. N., Ohmacht, M., Barton, C., Silvera, R., Michael, M. M. 2012. Evaluation of Blue Gene/Q hardware support for transactional memories. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques:
LOVE IT, HATE IT? LET US KNOW [email protected]
Maged M. Michael is a research staff member at the IBM Thomas J. Watson Research Center. He received a Ph.D. degree in computer science from the University of Rochester. His research areas include concurrent algorithms, nonblocking synchronization, transactional memory, multicore systems, and concurrent memory management. He is an ACM Distinguished Scientist and a member of the Connecticut Academy of Science and Engineering.
© 2013 ACM
Originally published in Queue vol. 11, no. 7—
Comment on this article in the ACM Digital Library
Adam Morrison - Scaling Synchronization in Multicore Programs
Designing software for modern multicore processors poses a dilemma. Traditional software designs, in which threads manipulate shared data, have limited scalability because synchronization of updates to shared data serializes threads and limits parallelism. Alternative distributed software designs, in which threads do not share mutable data, eliminate synchronization and offer better scalability. But distributed designs make it challenging to implement features that shared data structures naturally provide, such as dynamic load balancing and strong consistency guarantees, and are simply not a good fit for every program. Often, however, the performance of shared mutable data structures is limited by the synchronization methods in use today, whether lock-based or lock-free.
Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, Mark Roth - Challenges of Memory Management on Modern NUMA System
Modern server-class systems are typically built as several multicore chips put together in a single system. Each chip has a local DRAM (dynamic random-access memory) module; together they are referred to as a node. Nodes are connected via a high-speed interconnect, and the system is fully coherent. This means that, transparently to the programmer, a core can issue requests to its node’s local memory as well as to the memories of other nodes. The key distinction is that remote requests will take longer, because they are subject to longer wire delays and may have to jump several hops as they traverse the interconnect.
Spencer Rathbun - Parallel Processing with Promises
In today’s world, there are many reasons to write concurrent software. The desire to improve performance and increase throughput has led to many different asynchronous techniques. The techniques involved, however, are generally complex and the source of many subtle bugs, especially if they require shared mutable state. If shared state is not required, then these problems can be solved with a better abstraction called promises. These allow programmers to hook asynchronous function calls together, waiting for each to return success or failure before running the next appropriate function in the chain.
Davidlohr Bueso - Scalability Techniques for Practical Synchronization Primitives
In an ideal world, applications are expected to scale automatically when executed on increasingly larger systems. In practice, however, not only does this scaling not occur, but it is common to see performance actually worsen on those larger systems.