Wednesday, October 26, 2011

False Sharing

If multiple threads concurrently change the values in a matrix, which is faster: working through the matrix column-by-column or row-by-row? And why is there a difference?

Let's say we have a square, 2-dimensional matrix of ints that we model like so:
`int SIZE    = 1000;int ROWS    = SIZE;int COLUMNS = SIZE;int[][] matrix = new int[ROWS][COLUMNS];`

Technically, Java doesn't have multi-dimensional arrays. It has arrays of arrays. But for our purposes, let's define the first index as the row and the second as the column.

Now, let's say that at the same time our many threads change all the values in a given column with the following code:
`  // change all the values in a fixed column of our matrix for (int i = 0 ; i < ROWS ; i++) {    int value = matrix[i][fixed_index];    matrix[i][fixed_index] = value++; }`
and after this code has finished running, we change it slightly so that now the many threads change all the values in a given row.
`    // change all the values in a fixed row of our matrix   for (int i = 0 ; i < COLUMNS ; i++) {       int value = matrix[fixed_index][i];       matrix[fixed_index][i] = value + 1;   }`
This time, we've fixed the "row" index and let the "column" index vary so all values in a given row change.

The results on my MacBook Pro (Intel Core 2 Duo) running 100 threads each performing 1000 iterations look like this:

Run____fixed column____fixed row
1__________209_____________91___
2__________194_____________78___
3__________201_____________92___
4__________199_____________60___
5__________228_____________63___

[mean time in ms. Apologies for the formatting. Blogger keeps trying to eat whitespace].

As you can see, changing all the values in one of our rows is much faster than changing all the values in one of our columns.

Remembering that there are really no rows and columns in Java (we're actually modeling our matrix using arrays of arrays), the easy way to remember this is the 3 Fs: when it comes to changing the values along a give index of a 2-D array, Fixing the First is Fastest.

This phenomena is due to a mechanism called "false sharing" (there is a good podcast by Scott Meyers on the matter here).

False sharing slows concurrent changes to RAM. It occurs when one thread is writing to memory that is close to where another thread is writing. Robert Love defines it as when there are "two or more objects mapping to the same cache line despite existing at different addresses in memory" (Linux Kernel Development 3r Edition).

The maximum size of memory in which these two or more addresses need to be for this to happen is called a cache line. The exact size is dependent on the architecture but it is typically 64 bytes.

What's happening in the above example is that ints in the column of our matrix are evidently stored close together - so close there is some cache line collisions occurring - and that threads are contending to write to that piece of RAM.

I'll write another post as to how we can prove this.