Agile Java Man: R

Showing posts with label R. Show all posts

Monday, October 20, 2014

New to Java 8: Adders

I was reading this blog and played with the suite that tests how fast read/writes to a long are and I came across a class new to JDK 1.8, LongAdder.

(The test suite has a problem that the class that uses optimistic locking and a StampedLock ignores whether it has attained the lock but I'll deal with that elsewhere).

It's quite a cunning idea. If there is contention on the field containing the long value, the thread that fails to attain the lock creates a new memory location (a Striped64.Cell object) and adds its values there. When it comes to seeing the results of many threads mutating the value, the reader thread returns not just the base value but all values in the additional memory. Thus contention is minimized.

There is a caveat when it comes to adding up all the additional values. From the JavaDoc:

The returned value is NOT an atomic snapshot; invocation in the absence of concurrent updates returns an accurate result, but concurrent updates that occur while the sum is being calculated might not be incorporated.

Anyway, I updated the suite mentioned above adding a modified optimistic locking class that retries in the event not attaining the lock (OPTIMISTIC_CORRECT in the diagram below). The logs output look like this:

name,threads,duration
ADDER,16,782
ADDER,16,810
ADDER,16,769
ADDER,16,792
ADDER,16,771
ADDER,16,793
ADDER,16,735
ADDER,16,777
ADDER,16,768
ADDER,16,849
ATOMIC,16,514

and wrote some R code to process it and give me a graphic of all the different techniques:

.libPaths("/home/henryp/R/x86_64-unknown-linux-gnu-library/3.1/")
require("epade")
data <- read.csv("/home/henryp/Code/R/JavaLocks/firstPastPost1W15R.csv")

groupBy   <- list(data$name, data$threads)
aggregatedMeans <-aggregate(data,
by=groupBy,
FUN=mean,
na.rm=TRUE)
aggregatedSds   <- aggregate(data,
by=groupBy,
FUN=sd,
na.rm=TRUE)

meansAndSds   <- cbind(aggregatedMeans[,c(1,5)],
aggregatedSds[,5])

ordered   <- meansAndSds[order(meansAndSds[,2]),]

bar3d.ade(data.matrix(ordered[,2],
rownames.force = NA),
main="10000000 reads of a montonically incrementing long",
y=ordered[,1],
zticks=ordered[,1],
xticks="1 writer, 15 readers",
ylab="Duration (ms)")

The means and standard deviations look like this:

> ordered

Group.1 duration aggregatedSds[, 5]

3 DIRTY 211.9 11.95780

2 ATOMIC 594.2 71.89777

9 VOLATILE 641.0 72.22188

1 ADDER 708.7 73.71122

5 OPTIMISTIC_CORRECT 1379.3 141.92334

4 OPTIMISTIC 7518.0 1392.87967

8 SYNCHRONIZED 8682.4 1507.51283

6 RWLOCK 10515.4 665.83251

7 STAMPED 21859.3 5076.10047

or, graphically:

Note: this has one writing thread and 15 reading threads on a 16 core (using hyperthreading) CPU. DIRTY means there was no locking, the fastest but least accurate technique. Duration (the vertical axis) is in milliseconds. I don't know why this axis is not labelled when I save the image as it is displayed in RStudio.

So, ADDER is indeed fast but not as fast as ATOMIC (which is also more accurate) on this computer and with this read/write profile. I'll try other numbers of readers and writers later.

Monday, October 13, 2014

R Tip of the Day

If I'd known this a month ago, I'd have saved myself a lot of time. I was pulling GC data from a JVM's logs but my pattern matching wasn't perfect so I was getting a lot of dirty data (NA in R).

To clean these from a vector, you can do something similar to this:

data <- c( 1, 2, 3, NA, 5, 6 ) # data with some dirty elements
badData <- is.na(data) # elements are TRUE or FALSE
data(!badData)
cleanData <- data[!badData]

Similarly, if you have 2-dimensional (or more) data, you can clean all data when there is as much as one element of a tuple that is dirty. For example:

x <- c( 1, 2, 3, NA, 5, 6 )

y <- c( 10, 20, NA, 40, 50, 60 )

cleaned <- complete.cases(x, y)

will clean all data where either x or y (or both) is dirty:

> x[cleaned]

[1] 1 2 5 6

> y[cleaned]

[1] 10 20 50 60

Monday, August 18, 2014

R

R is a great tool for plotting performance graphs. Every developer who worries about the performance profile of his application should learn it. Here are some miscellaneous scripts I've been using for the last few months.

But first...

A Crash Course in R

R is Modular

You might find that the standard installation does not have all the modules you need. Any R GUI will help you download the ones you want. But to use them, you need something like this:

.libPaths("YOUR_MODULE_DIRECTORY")
require("MODULE_NAME")

Think of this as setting the classpath and then using import in your Java files.

Loading the data

Take this example:

timings_data <- read.table("C:/Users/henphi/db_saves.txt")
timings <- c(timings_data$V15)
times <- timings_data$V1

The first line reads the raw data that I saved in a file.

The second line takes the 15th column of data (think of it as the awk command in Unix) and puts it into a vector. Like awk, R treats the first column as index 1, not zero. There is a 0th column but this is the line number. We put the data into a vector using the c() command.

The third lines is just like the second but takes just the 1st ($V1) column. Here, I chose not to use the c() command but that's OK. In both cases it doesn't matter as they are both vectors already. In the paragraph above, I just wanted to introduce the standard way of making vectors in R.

Coercing the Data

We've already mentioned one data structure - vectors that are created using the c() command. Another is a list that can be created using (unsurprisingly) the list() command.

In this example, I am taking the variables defined in the section above and manipulating the types to make something I can plot.

data <- structure (
list(
time = structure(
as.POSIXct(strptime(times, '%H:%M:%S')),
class = c("POSIXct", "POSIXt"),
tzone = ""
),
variable = timings
),
.Names = c("time", "variable"),
row.names = c(NA, -10L),
class = "data.frame"
)

Here, I have made a data structure of a type called a data frame. "Although a data set of 50 employees with 4 variables per worker has the look and feel of a 50x4 matrix, it does not qualify as such in R because it mixes types. Instead of a matrix we use a data frame. A data frame in R is a list with each component of the list being a vector corresponding to a column in our 'matrix' of data... Typically, data frames are created by reading in a data set from a file". [1]

As part of this structure, we have a list that has two entries that go by the arbitrary names of time and variable. Variable is simple, it's just the timings variable that we loaded from a file earlier.

The element of this list, time, is also a list but we have coerced the string representations of our times to a type that represents a date. This is done by the strptime function that converts the string in our data file to a standard type using regexp. Once in the standard format, it is then converted to a Unix time-from-epoch using as.POSIXct.

(Since the data file only has time of day entries in it, R will assume they belong to the day on which the line runs and give it "today's" date. That's OK as I know my data is only within an hour but I don't care which hour it was.)

The .Names element defines the names of the data objects within this structure. They would be used as default in the graphs we generated but we'll override them later.

A real world example

Let's say we want to plot the Garbage Collection times of our app. A couple of command line switches will help.

-XX:+PrintGCApplicationStoppedTime which GC guru, Gil Tene, considers "is probably the only number in GC logs that you should sort-of-semi-believe."
-XX:+PrintGCDateStamps which makes the logs much easier to deal with than time-since-epoch.
-Xloggc:/YOUR/FILE/HERE to make the JVM output log stats in the first place.

I clean this file up just to grab the application pause times:

$ grep "Total time for which application threads were stopped" jstringserver.gc.log > jstringserver_cleaned.gc.log

such that the slightly cleaned GC logs look like this:

$ more /home/henryp/Documents/Blogs/jstringserver_cleaned.gc.log
2014-08-18T15:17:29.824+0100: 1.096: Total time for which application threads were stopped: 0.0004190 seconds
2014-08-18T15:17:32.895+0100: 4.166: Total time for which application threads were stopped: 0.0001930 seconds
2014-08-18T15:18:40.902+0100: 72.173: Total time for which application threads were stopped: 0.0001650 seconds
2014-08-18T15:18:43.903+0100: 75.174: Total time for which application threads were stopped: 0.0000450 seconds

Now, my R script looks like this:

gc_data <- read.table("/home/henryp/Documents/Blogs/jstringserver_cleaned.gc.log")

data <- structure (
list(
time = structure(
as.POSIXct(strptime(gc_data$V1, '%Y-%m-%dT%H:%M:%S')),
class = c("POSIXct", "POSIXt"),
tzone = ""
),
variable = gc_data$V11
),
.Names = c("time", "variable"),
class = "data.frame"
)

plot(
data,
xaxt="n",
type="h",
main="GC times for JStringServer",
ylab="GC Pause Time (ms)",
xlim=c(as.POSIXct(strptime("15:20", "%H:%M")),
as.POSIXct(strptime("15:30", "%H:%M")))
)

axis.POSIXct(side=1, at=cut(data$time, "min"), format="%H:%M")

Where I have zoomed-in on a period of time I am interested in, namely from 15:20 to 15:30 and axis.POSIXct just put's the time ticks on the x-axis after we disabled x-axis marking in the plot() command with xaxt="n".

The generated graph looks like:

Note that if you prefer, you can set log="y" as an argument in the plot command and have the y-axis scaled logarithmically.

[1] The Art of R Programming.

Sunday, June 15, 2014

Lies, Damned Lies and Performance Statistics

A new way of measuring performance

I've introduced HdrHistogram to my pet project, JStringServer. The advantages of HdrHistogram is that it keeps the results in constant memory with constant access time. The only cost is some approximation of results and since I'm only interested in means, standard deviations etc I can live with that.

HdrHistogram works by keeping an array of "buckets" which represent a range of values. Any given value that falls into a particular bucket merely increases that bucket's counter. This is much more memory efficient than, say Apache's JMeter that keeps every reading in memory.

The range of these buckets grow exponentially (well, there are sub-buckets but let's keep this simple). So, all results that are outliers are collected in just one or two buckets. This is very memory efficient as each bucket is a Java long each.

Means, standard deviations and all that

Most performance tools like Apache's JMeter will give you a mean and a standard deviation. You might have even heard that 68% of results are within a standard deviation of the mean, 95% within 2 standard deviations, 99.7% within 3 etc (see here for the rule). But this is only true for a normal distribution (the typical, bell-shaped curve).

The time it takes for a given measurement independent of all others will typically conform to a Poisson distribution. (See Mathematical Methods in the Physical Sciences by Mary Boas for a clear derivation of this formula from first principles.)

Results

Typically, I see:

Initiated 412759 calls. Calls per second = 10275. number of errors at client side = 0. Average call time = 10ms
ThreadLocalStopWatch [name=read, totalCallsServiced=412659, max time = 630ms, Mean = 6.125236575477573, Min = 0, standard deviation 8.263969133274433, Histogram with stepsize = 1. Approximate values: 791, 1822, 3524, 15650, 143582, 307867, 243841, 81151, 18447, 4716, 1395, 493, 314, 195, 106, 97, 82, 66, 35, 10, 6, 7, 10, 7, 12, 14, 8, 7, 4, 2]
ThreadLocalStopWatch [name=write, totalCallsServiced=412719, max time = 26ms, Mean = 0.08618696982692825, Min = 0, standard deviation 0.327317028847167, Histogram with stepsize = 1. Approximate values: 411831, 33604, 771, 238, 80, 32, 17, 8, 1, 1, 2, 4, 5, 3, 0, 0, 1, 1, 1, 1, 0, 0, 1, 2, 5, 8, 4, 0, 0, 0]
ThreadLocalStopWatch [name=connect, totalCallsServiced=412720, max time = 3005ms, Mean = 4.17902209730568, Min = 0, standard deviation 65.54450133531354, Histogram with stepsize = 1. Approximate values: 408721, 117052, 2056, 852, 282, 114, 57, 30, 15, 12, 10, 10, 15, 10, 3, 5, 4, 3, 2, 1, 3, 4, 2, 0, 1, 1, 0, 1, 1, 0]
ThreadLocalStopWatch [name=total, totalCallsServiced=410682, max time = 66808064ns, Mean = 6295825.973268021, Min = 854104, standard deviation 1089344.7561186429, Histogram with stepsize = 1333333. Approximate values: 145, 837, 1596, 25512, 273990, 93021, 12099, 2283, 520, 273, 126, 77, 80, 55, 11, 9, 5, 9, 10, 11, 15, 5, 3, 2, 3, 3, 0, 0, 4, 1, 2]

For something that approximates to a Poisson distribution, we'd expect the mean and standard deviation to be about the same. Since this is not true for the total time, perhaps this is not a Poisson distribution. The results for reading data does have these two values roughly the same so let's look at them.

Let's see if the distribution of the values conform to the Poisson distribution or even the Normal (aka Gaussian). In the R language:

require(graphics)

# Histogram of times. Each interval is 1ms starting from 1ms
xread <- c(791, 1822, 3524, 15650, 143582, 307867, 243841, 81151, 18447, 4716, 1395, 493, 314, 195, 106, 97, 82, 66, 35, 10, 6, 7, 10, 7, 12, 14, 8, 7, 4, 2)
readMean = 6.125236575477573
readSd = 8.263969133274433
readNum = 412659

par(mfrow = c(3, 1)) # we're looking at 3 graphs

n = readNum
x = xread
mean = readMean
sd = readSd

ac <- barplot(x, main="Actual Distribution", axes=TRUE)
axis(1, at=ac, labels=TRUE)

i <- 1:length(x)

expectedPoisson = n * dpois(i, lambda=mean)
expectedNorm = n * dnorm(i, mean=mean, sd=sd)

nm <- barplot(expectedNorm, main="Expected Normal Distribution")
axis(1, at=nm, labels=TRUE)

pn <- barplot(expectedPoisson, main="Expected Poisson Distribution")
axis(1, at=pn, labels=TRUE)

chisq.test(x, p=expectedPoisson, rescale.p=TRUE)
chisq.test(x, p=expectedNorm, rescale.p=TRUE)

The last two lines are testing whether the actual distribution conforms to the Poisson (give a lambda value) or a Normal (given the mean and the standard deviation). It does this using the chi-squared test (basically, a comparison of the squared differences between what we expect and what we get).

For these particular results, the p-value from the chi-squared tests tell me they neither conform to a Poisson not a Normal distribution (in fact, they could barely be further from them). Looking at the graphs, you can see this by-eye:

So, what went wrong? Well, the Poisson distribution relies on events being independent of each other which is clearly not the case when there is contention for resources.

The Central Limit Theorem

(Aside):

The means of a distribution form a normal distribution given enough runs even if that original distribution is not itself a normal distribution.

Let's look at plotting the Poisson distribution in R:

mu=1; sigma=10 # mean, standard deviation
n=5000 # Number of iterations
xbar=rep(0,n) # Holds the results of the iterations

for (i in 1:n) {
xbar[i]=mean(
rpois(5000, mu) # num. random variables = 5000, mu = means
)
}

par(mfrow = c(2, 1))

# Plot a typical Poisson distribution
hist(rpois(n, mu),prob=TRUE,breaks=15)

# Plot the results of n Poisson distributions
hist(xbar,prob=TRUE) #,xlim=c(70,130),ylim=c(0,0.1))

Gives this result:

A typical Poisson distribution and the mean of a Poisson over many iterations

This is quite nice as it means that if you run performance tests repeatedly, your means will approach a Normal Distribution giving some predictive power over them.

Agile Java Man