Fast concurrent queues for x86 processors (PPoPP '13)
Description
Conventional wisdom in designing concurrent data structures is to use the most powerful synchronization primitive, namely compare-and-swap (CAS), and to avoid contended hot spots. In building concurrent FIFO queues, this reasoning has led researchers to propose combining-based concurrent queues. This paper takes a different approach, showing how to rely on fetch-and-add (F&A), a less powerful primitive that is available on x86 processors, to construct a nonblocking (lock-free) linearizable concurrent FIFO queue which, despite the F&A being a contended hot spot, outperforms combining-based implementations by 1.5x to 2.5x in all concurrency levels on an x86 server with four multicore processors, in both single-processor and multi-processor executions. Link to the paper: |
Questions
Question 1:
In a program where one thread does an enqueue operation and another does a dequeue operation using the infinite array queue primitives given in Figure 2, give two different possible interleavings of instructions by the two threads in which the dequeue operation returns the value inserted by the enqueue operation. In each interleaving mark two instructions as suitable linearization points.
Question 2:
In the beginning of paragraph 4.1 the authors claim that head and tail can be realistically assumed to not exceed 2^63. According to your own experience with queues in concurrent applications, give an example where a lower limit would also be reasonable. Would it make sense to use such a lower limit in the implementation from a practical point of view?
Jonatan
Q1: Threads a and b. a2 a3 a4(L) b7 b8 b9(L) b10 b8 b9 b10 a3 a4 b11 b7 b8 a3 a4(L) b9(L) b10 Q2: No. |
Kjell
Q1: a2 a3 b7 b8 a4(L) b9(L) b10 b7 b8 a2 a3 a4(L) b9(L) b10 Q2: It would not be practical to use 63 bits for the counter when the computer that will use the queue only supports atomic fetch and add with 32 bits integers. |
Andreas
Q1: e:2 e:3 [e:4a] e4b d:7 d:8 [d:9] d:10a d:10b Q2: The only alternative is to use 32bit instead of 64bit for the indices to fit in |
Pan
Q1: Give thread a and thread b: Q2: 32 bits might be an alternative. |
Fred
Q1: Situation 2: Let thread 1 execute the enqueue until the SWAP completes (even before the equality test, the result of the swap is stored in a temporary variable). Let then thread 2 execute the dequeue until the SWAP completes. Let them finish in any order they want. Let the linearization points be the swaps. Those interleavings will correspond to executions where the second thread dequeues the object that the first thread enqueued. Q2: Eh? ...I'm guessing no... |
Martin
Q1: Interleaving 1: Interleaving 2: The SWAP instructions are the linearization points. Q2: |
Pavol
Q1 That's two cases where we don't have to run through the loop again. Q2: |
Afshin
Q1: Q2: |
Yunyun
Q1: Q2: |
Magnus
Q1: t1 enqueues, t2 dequeues Interleaving 1: Interleaving 2: Q2: |