Skip to main content
Department of Information Technology

Fast concurrent queues for x86 processors (PPoPP '13)

Description

Conventional wisdom in designing concurrent data structures is to use the most powerful synchronization primitive, namely compare-and-swap (CAS), and to avoid contended hot spots. In building concurrent FIFO queues, this reasoning has led researchers to propose combining-based concurrent queues.

This paper takes a different approach, showing how to rely on fetch-and-add (F&A), a less powerful primitive that is available on x86 processors, to construct a nonblocking (lock-free) linearizable concurrent FIFO queue which, despite the F&A being a contended hot spot, outperforms combining-based implementations by 1.5x to 2.5x in all concurrency levels on an x86 server with four multicore processors, in both single-processor and multi-processor executions.

Link to the paper:
[1]

Slides

Questions

Question 1:

In a program where one thread does an enqueue operation and another does a dequeue operation using the infinite array queue primitives given in Figure 2, give two different possible interleavings of instructions by the two threads in which the dequeue operation returns the value inserted by the enqueue operation. In each interleaving mark two instructions as suitable linearization points.

Question 2:

In the beginning of paragraph 4.1 the authors claim that head and tail can be realistically assumed to not exceed 2^63. According to your own experience with queues in concurrent applications, give an example where a lower limit would also be reasonable. Would it make sense to use such a lower limit in the implementation from a practical point of view?

Jonatan

Q1: Threads a and b.

a2 a3 a4(L) b7 b8 b9(L) b10

b8 b9 b10 a3 a4 b11 b7 b8 a3 a4(L) b9(L) b10


Q2: No.

Kjell

Q1:

a2 a3 b7 b8 a4(L) b9(L) b10

b7 b8 a2 a3 a4(L) b9(L) b10


Q2:
An example of a situation where less than 63 bits could be used for head and tail is when the number of elements that will be inserted into the queue is known in advance.

It would not be practical to use 63 bits for the counter when the computer that will use the queue only supports atomic fetch and add with 32 bits integers.

Andreas

Q1:
e:4a SWAP...
e:4b return ok;
d:10a x!=..
d:10b return ..
d:11a (tail...)
d:11b return ..

e:2 e:3 [e:4a] e4b d:7 d:8 [d:9] d:10a d:10b
d:7 d:8 e:2 e:3 [e:4a] e:4b [d:9] d:10a d:10b


Q2:
Simulating customers of a supermarket with queueing networks does not require
those amounts of elements per queue, unless you run a very successful
businesses.

The only alternative is to use 32bit instead of 64bit for the indices to fit in
one data word. But this would limit the amount of data elements do not exceed
2^31 which is quite low.

Pan

Q1: Give thread a and thread b:
a2 a3 b7 a4(L) b8 b9(L) b10
b7 b8 a2 a3 a4(L) b9(L) b10


Q2: 32 bits might be an alternative.

Fred

Q1:
Situation 1: first enqueue, then dequeue (no interleaving, atomically). That's a valid interleaving that dequeues the object enqueued by the first thread. Make any instruction be the linearization point in each method. The methods seem indeed to happen atomically at those linearization points.

Situation 2: Let thread 1 execute the enqueue until the SWAP completes (even before the equality test, the result of the swap is stored in a temporary variable). Let then thread 2 execute the dequeue until the SWAP completes. Let them finish in any order they want. Let the linearization points be the swaps. Those interleavings will correspond to executions where the second thread dequeues the object that the first thread enqueued.


Q2: Eh? ...I'm guessing no...

Martin

Q1:
Let thread A enqueue, and thread B dequeue.

Interleaving 1:
A: t := F&A(&tail, 1)
B: h := F&A(&head, 1)
A: SWAP(&Q[t], x) -> bottom
A: return true
B: SWAP(&Q[h], T) -> x
B: return x

Interleaving 2:
B: h := F&A(&head, 1)
A: t := F&A(&tail, 1)
A: SWAP(&Q[t], x) -> bottom
A: return true
B: SWAP(&Q[h], T) -> x
B: return x

The SWAP instructions are the linearization points.


Q2:
It could make some sense to use 2^31 as the upper limit. Then the Node struct is only 64 bits, and only 64-bit atomic instructions are needed.

Pavol

Q1
Let's have two threads doing T1: enqueue(x), T2: dequeue()
First scenario:
T1: Line 3, T2: Line 8, T1: Line 4, T2: Line 9, T2: Lines 10+11
Second scenario:
T2: Line 8, T1: Line 3, T1: Line 4, T2: Line 9, T2: Lines 10, 11

That's two cases where we don't have to run through the loop again.
In both cases, the SWAP operations at lines 4 and 9 would be suitable linearization points.


Q2:
If the queue is used in a simulator or other applications where one knows the problem size in advance, one could analytically calculate the maximal size of the problem and e.g. use 32bit word as index if that is sufficient.

Afshin

Q1:


Q2:
It is OK to get less than 63 bits for the indices but according to the results shown in the paper , taking too small ring sizes(less than 2^10) destroys the throughput as well.

Yunyun

Q1:
a2 a3 a4(L) b7 b8 b9(L) b10
b7 b8 a2 a3 a4(L) b9(L) b10


Q2:
A lower limit would be 2^31. Is it too low...?

Magnus

Q1:

t1 enqueues, t2 dequeues

Interleaving 1:
t1: l2
t1: l3
t1: l4, return true (L)
t2: l7
t2: l8
t2: l9 (L)
t2: l10, return x

Interleaving 2:
t2: l7
t1: l2
t2: l8
t1: l3
t1: l4, return true (L)
t2: l9 (L)
t2: l10, return x


Q2:
If you know a priori that the number of indices is small enough, it could perhaps make sense to have a lower limit of 2^31.

Updated  2013-04-29 16:45:45 by Stavros Aronis.