Wednesday, October 26, 2011

False Sharing

If multiple threads concurrently change the values in a matrix, which is faster: working through the matrix column-by-column or row-by-row? And why is there a difference?

Let's say we have a square, 2-dimensional matrix of ints that we model like so:
int SIZE    = 1000;
int ROWS = SIZE;

int[][] matrix = new int[ROWS][COLUMNS];

Technically, Java doesn't have multi-dimensional arrays. It has arrays of arrays. But for our purposes, let's define the first index as the row and the second as the column.

Now, let's say that at the same time our many threads change all the values in a given column with the following code:
  // change all the values in a fixed column of our matrix
for (int i = 0 ; i < ROWS ; i++) {
int value = matrix[i][fixed_index];
matrix[i][fixed_index] = value++;
and after this code has finished running, we change it slightly so that now the many threads change all the values in a given row.
    // change all the values in a fixed row of our matrix
for (int i = 0 ; i < COLUMNS ; i++) {
int value = matrix[fixed_index][i];
matrix[fixed_index][i] = value + 1;
This time, we've fixed the "row" index and let the "column" index vary so all values in a given row change.

The results on my MacBook Pro (Intel Core 2 Duo) running 100 threads each performing 1000 iterations look like this:

Run____fixed column____fixed row

[mean time in ms. Apologies for the formatting. Blogger keeps trying to eat whitespace].

As you can see, changing all the values in one of our rows is much faster than changing all the values in one of our columns.

Remembering that there are really no rows and columns in Java (we're actually modeling our matrix using arrays of arrays), the easy way to remember this is the 3 Fs: when it comes to changing the values along a give index of a 2-D array, Fixing the First is Fastest.

This phenomena is due to a mechanism called "false sharing" (there is a good podcast by Scott Meyers on the matter here).

False sharing slows concurrent changes to RAM. It occurs when one thread is writing to memory that is close to where another thread is writing. Robert Love defines it as when there are "two or more objects mapping to the same cache line despite existing at different addresses in memory" (Linux Kernel Development 3r Edition).

The maximum size of memory in which these two or more addresses need to be for this to happen is called a cache line. The exact size is dependent on the architecture but it is typically 64 bytes.

What's happening in the above example is that ints in the column of our matrix are evidently stored close together - so close there is some cache line collisions occurring - and that threads are contending to write to that piece of RAM.

I'll write another post as to how we can prove this.


  1. The difference you've measured is probably more related to cache and lack of spatial locality. In the column case, you're only reading one value from each cache block, then forcing the system to fetch a new one. When updating the row, you have spatial locality, and get the benefit of the processor already having many of your values in cache.

    1. It's a very good point but I would imagine a number of threads acting independently should effectively randomize this. For example, thread number 7 (say) can have its next cache line value swapped out by thread number 95 (say) even if it's reading rows rather than columns.

      To demonstrate my theory, I modified my original code to use longs rather than ints and to skip the next 7 values after each 'hit'. This ensures that all values (row or column) are on different cache lines since my cache line is only 64 bytes on my MacBook (sysctl -a | grep cachelinesize).

      Now, the results are the pretty much the same for both 'rows' and 'columns' - a mean time of about 50ms. This suggests to me that it is indeed a phenomena due to cache lines.

      Let me know what you think :)