Skip to main content
Department of Information Technology

NUMA-Aware Reader-Writer Locks (PPoPP'13)


The paper describes a new type of Reader-Writer Lock that is especially designed to perform well on Non-Uniform Memory Access (NUMA) systems. The authors compare variants of the new lock and other relevant locks.

Link to paper:


Question 1:

Fred described the queue based locks (MCS and CLH) during the introduction lecture. These locks can be considered to be fair because a thread that put itself in the queue before another thread will execute the critical section before that other thread. What performance advantage do you think locks that are not fair have on Non-Uniform Memory Access (NUMA) systems?

Question 2:

Let us say that we have a program running on a NUMA system with a total of 64 cores. The program has 64 threads. All the threads are accessing a hash table that is protected by a single Reader-Writer (RW) lock. The hash table is accessed for reading 80% of the times and for writing 20% of the times it is accessed. Why could the program benefit from using a RW lock with writer-preference compared to a RW lock with reader-preference or neutral-preference?


Q1: If the running thread is about to release the lock, it can be tempting to give it to another thread that is loaded on another core of the same chip, eventhough it is not the designated successor (not the first one of the queue) because the cache-to-cache communication will be cheaper. If the designated successor was loaded on a core of another chip, being not fair would save us the extra costly communication.

Q2: The answer is in section 3.4 and especially in the section 4.3 (about the reader-preference performance pathology).


Q1: Intra-node communication is more expensive. Unfortunate interleavings of lock-accesses from different NUMA-nodes could be avoided by letting all lock accesses originating from the same NUMA-node finish first.

Q2: Assuming any thread may become a reader with 80% probability: Once a writer leaves the writer queue, it may turn into a reader with a high probability, blocking the increasingly long writer queue.


In the worst case, a fair algorithm distributes the accesses alterning on two cores. This would require to constantly exchange the look between the involved nodes, which is slow on NUMA architectures. Non-fair algorithms can reorder the operations to reduce the amount of times the lock is exchanged.

In a system where we have 80\% reads and 20\% writes, prioritizing writes leads to a buildup of read requests. Write request are executed sequential anyways, meaning that the speed-up of our application is primarily dependent on the parallelization of the read requests. A read preference will execute read requests even when there is not much parallelism to exploit. This preference would also postpone write requests which could enable additional read requests. (hence we have 80\% reads vs. 20\% writes) The same is true for a neutral preference.


Q1: If the lock does not need to be fair, the lock can be handed over to a thread on the same NUMA node after one thread releases the lock. This can reduce the coherence traffic between NUMA nodes.

Q2: When a writer finishes the operation, it is likely to turn into a reader again (since the probability of writes is 20% and the probability of reads is 80%) . In a RW-lock with read-preference, the writers need to wait for the readers to finish in order to proceed. This reader which is previously a writer will block all the other writers from proceeding.


If locks are handed over to threads in the same NUMA node, the accessed data might need to be send back and forth between NUMA nodes less frequently. And since communication is cheaper within a NUMA node than between NUMA nodes, this is good.

By letting writers ahead in the line, readers will queue up. When the readers are allowed to execute there will be more of them compared to the other schemes. As readers can execute concurrently, this gives more parallelism, which makes the total runtime shorter.


Considering the cache locality it is more efficient to let the threads that have their data in cache to be scheduled for run first rather than running them in a fair FIFO of their order of coming to run.

Thanks to Fred's answer, the write-preference RW lock shows better performance due to the fact that when write accesses are processed first the read ones will be collected and can be scheduled at once. On the other hand, when read accesses are given priority then the few write ones have to wait until all (frequent) reads finish which degrades performance and even when a write operation finishes it will change to a read one with a high probability (due to the rate of reads and writes; which is 4 to 1)


On NUMA systems, requests that arrive at the same lock might be spread over the physical cores of the system. In these cases, it might be beneficial to group together requests coming from cores that are grouped on the physical layout, in order to improve cache performance, as these cores are more probable to share cache data that are structural components of the lock or are protected by it. Moreover, being unfair allows the use of more advanced techniques, like lock cohorts, that reduce traffic on the interconnect between the physical modules, by letting the lock be transferred "locally", without generating cache-coherence-related interconnect traffic.

By letting the (comparatively fewer) writes be scheduled first, we avoid the problematic scenario where writes are constantly overtaken by frequently arriving reads, becoming impatient and forcefully blocking the reads from executing. Writer preference in such a case could also potentially cause better read performance, as reads are still grouped together after the writes.


Lock being handed over between threads in the same NUMA node but not according to the requesting order is unfair. But this saves the traffic among different NUMA nodes.

With much more readers than writers, readers will always be excecuted first which make writers starve in the reader-preference scheme. On the other hand, writers are able to hold the lock first given the preference while readers are more likely to be executed after the writer operation finishes.


Due to cache coherence the cost of transfering locks over different NUMA nodes might be costly.
NUMA allows to have inter-node-communication so threads sitting in the same node should benefit from this.

This can be seen in figure 6d the of this paper where the same ratio of readers/writer has been taken into account.
The reader priority shows a much lower throughput when the amount of threads increases, while the throghput of the writer priority case is about 25x faster on 80 threads. The reason is that the reader threads, queue up while waiting for the writer lock to release, which gets the preference faster. As they are parallel, the processing of these queued reader threads will be more efficient then the processing of queued writer-threads which will be processed sequentially nevertheless.


If lock accesses are prioritized such that the lock will stay with a thread in the same NUMA-node, some performance will be gained due to less intra-node communication than if the lock would move between NUMA-nodes.

With writer-preference, writes will get priority and stall the reads, causing them to queue up. When the reads get executed there will be more of them to do in parallel. The percentage of writes is rather low so the impact on the throughput of reads is quite small.


It may be too expensive to pass the lock between different NUMA nodes. It may be cheaper to let a waiting thread on the same NUMA node acquire a lock (unfairly) rather than let waiting threads on a different NUMA node acquire the lock in turn.

Any writer will likely become a reader after completing its operation. Giving preference to writers will result in writes executing first, one at a time, while readers wait. When the writer releases the lock, all writers can execute concurrently. If readers are given preference, then both readers and writers will have to wait, and we'll miss the performance advantage of having many readers executing in parallel.

Updated  2013-06-03 16:43:54 by Sofia Cassel.