# Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Description (Abstract)
On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. |

## Questions

**Question 1:** The authors want to increment floating point entries atomically. How could one do this (efficiently) on common architectures?

**Question 2:** The authors use a constant c to quantify the extra cost of atomic operations in the analysis of their algorithm. Does it, in your point of view, make sense to use such constant when setting up a performance model?

Kjell
Q1: incAtomic(address){ do{ old = *address; new = old + 1.0; }while(!CAS(address, old, new)); } Q2: |
Martin
Q1: It's tricky because the compare-and-swap instructions only work on integer values. The direct way would be: void atomic_add(double *elem, double delta) { do { load elem -> old_int compiler_fence load elem -> double_val double_val += delta new_int = reinterpret_float_as_int(double_val) } while (!cas(elem, old_int, new_int)) } This assumes that the increments are actually increments so that the value is increasing monotonically. If delta is allowed to be negative, this doesn't work because of an ABA problem, in which case the load of double_val needs to be replaced by reinterpreting old_val, which means an extra store. Reinterpreting between floats and ints requires a store and a load, so this code performs the following operations: [ load load fadd store load cas ], where brackets means possible looping. It is probably better to do this as a lock: void atomic_add(double *elem, double delta) { // tatas spin-lock do { do { load elem -> oldint } while (oldint == NaN) } while (!cas(elem, oldint, NaN)) // element is locked, and can be updated double floatval = reinterpret_int_as_float(oldint) floatval += delta store floatval -> elem } which performs the following operations: [ [ load ] cas ] store load fadd store. (Sorry for the much too detailed answer...) Q2: The cost of their atomic instruction is actually not constant but depends on the level of contention. However, one can interpret this as they assume a certain maximum level of contention, which I think is fine.
if *elem is initially A load elem -> old_int // reads A into old_int // another thread changes elem to B load elem -> double_val // reads B into double_val // another thread changes elem to A double_val += 1.0 // double_val is now B+1.0 cas(elem, A, B+1.0) // the compare-and-swap will be successful. Instead of increasing elem to A+1.0 it's increased to B+1.0, which is incorrect. To fix this we could do another type cast instead, but this requires one more store: void atomic_add(double *elem, double delta) { do { load elem -> old_int double_val = reinterpret_int_as_float(old_int) double_val += delta new_int = reinterpret_float_as_int(double_val) } while (!cas(elem, old_int, new_int)) } [ load store load fadd store load cas ] |

Stavros
Q1: Q2: |
Andreas
Q1: float *counter; float counter_buckets[NUMBER_OF_THREADS]; int thread_id; void inc_counter(float* counter) { float* old; do{ old = counter counter_buckets[thread_id] = *counter + 1.0; }while(!CAS(counter, old, &counter_buckets[thread_id])); } Q2: |

Magnus
Q1: Q2: |
Name
Q1: Q2: |