Skip to main content
Department of Information Technology

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Description (Abstract)

On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

Link to the paper

Questions

Question 1: The authors want to increment floating point entries atomically. How could one do this (efficiently) on common architectures?

Question 2: The authors use a constant c to quantify the extra cost of atomic operations in the analysis of their algorithm. Does it, in your point of view, make sense to use such constant when setting up a performance model?

Kjell

Q1:
One can increment a floating point number efficiently on modern processors with a so called compare and swap (CAS) loop. CAS is an instruction that takes two values and one address as input. The CAS instruction atomically updates the memory located at the given address with one of the provided values if the current value is the same as the other provided value. The return value of the CAS instruction is true if the update succeeded and false otherwise. A CAS instruction or a similar instruction is implemented in many modern multicore processors.

incAtomic(address){
  do{
    old = *address;
    new = old + 1.0;
  }while(!CAS(address, old, new));
}


Q2:
Yes, it definitely make sense to use such a constant because the cost of an atomic operation is different on different processors. An atomic instruction can take many times longer than a non-atomic instruction. To use a constant for all atomic operations can be troublesome in some cases because the time to complete an atomic-operation can depend on the activities of other threads (e.g. CAS loop).

Martin

Q1: It's tricky because the compare-and-swap instructions only work on integer values. The direct way would be:

void atomic_add(double *elem, double delta) {
  do {
    load elem -> old_int
    compiler_fence
    load elem -> double_val
    double_val += delta
    new_int = reinterpret_float_as_int(double_val)
  } while (!cas(elem, old_int, new_int))
}

This assumes that the increments are actually increments so that the value is increasing monotonically. If delta is allowed to be negative, this doesn't work because of an ABA problem, in which case the load of double_val needs to be replaced by reinterpreting old_val, which means an extra store.

Reinterpreting between floats and ints requires a store and a load, so this code performs the following operations: [ load load fadd store load cas ], where brackets means possible looping.

It is probably better to do this as a lock:

void atomic_add(double *elem, double delta) {
  // tatas spin-lock
  do {
    do {
      load elem -> oldint
    } while (oldint == NaN)
  } while (!cas(elem, oldint, NaN))
  // element is locked, and can be updated
  double floatval = reinterpret_int_as_float(oldint)
  floatval += delta
  store floatval -> elem
}

which performs the following operations: [ [ load ] cas ] store load fadd store.

(Sorry for the much too detailed answer...)


Q2: The cost of their atomic instruction is actually not constant but depends on the level of contention. However, one can interpret this as they assume a certain maximum level of contention, which I think is fine.

Edit: This is what I ment by the ABA problem:

if *elem is initially A

load elem -> old_int // reads A into old_int
// another thread changes elem to B
load elem -> double_val // reads B into double_val
// another thread changes elem to A
double_val += 1.0 // double_val is now B+1.0
cas(elem, A, B+1.0) // the compare-and-swap will be successful.

Instead of increasing elem to A+1.0 it's increased to B+1.0, which is incorrect.

To fix this we could do another type cast instead, but this requires one more store:

void atomic_add(double *elem, double delta) {
  do {                          
    load elem -> old_int         
    double_val = reinterpret_int_as_float(old_int)           
    double_val += delta                                      
    new_int = reinterpret_float_as_int(double_val)           
  } while (!cas(elem, old_int, new_int))                     
}                                                            

[ load store load fadd store load cas ]

Stavros

Q1:
In the context of matrix-vector multiplication, I don't think there is really an ABA problem to be dealt with. If two CAS operations cancel each other out, then so do their contributions. ABA problems arise when the 'restored' A value has a different meaning than the original, which is not the case when dealing with numbers. A CAS, operating on sufficiently many bits to represent the floating point number, is therefore enough for atomic operations on floats. It can be placed in a retry loop, as usual.
Thinking more generally, CAS loops may put too much contention on cache coherence managers. As padding is also not a solution (the author strives to make a compact representation), a lock with the cohort mechanism, such the one that we saw in a previous presentation, may perform better.


Q2:
My opinion is that the cost for atomic operations should be related to contention. The way I see it, if all the cores try to operate on the same value at the same time, the worst case scenario is that only one will succeed every time, leading to 1+2+...+n operations which is O(n^2). However the block approach and the sparsity of the matrix should make this lower.

Andreas

Q1:
One could access the float-counter through a pointer utilizing a field of floats.

float *counter;
float counter_buckets[NUMBER_OF_THREADS];
int thread_id;

void
inc_counter(float* counter) 
{
  float* old;
  do{
    old = counter
    counter_buckets[thread_id] = *counter + 1.0;
  }while(!CAS(counter, old, &counter_buckets[thread_id]));
}


Q2:
I think that the cost for atomic operations is much more dependent on the current contention. On the other hand could the contention be influenced by the (hardware) implementation of the atomic operations, so it could make sense indirectly. I think it does not matter too much though.

Magnus

Q1:
By interpreting floats as integers (e.g. with interpret_cast in C++), one can do a regular CAS on the old and the modified value. The ABA-problem of the naive version as described by Martin should be taken into account (e.g. by reinterpreting the loaded value), but if we can guarantee that delta's are strictly positive, we can save 20% of memory bandwidth by using the naive version (4 load/stores instead of 5). Interestingly the spin-lock version seems like the most efficient implementation, looking at the number of loads and stores.


Q2:
Contention should be taken into account. However, for a given matrix and its sparsity pattern, and for a given architecture, the contention should be rather constant so in my opinion, the constant c makes sense in the performance model.

Name

Q1:


Q2:

Updated  2013-06-26 12:05:41 by Andreas Löscher.