Feb 15, 2010

Statistics For Programmers III: Differences of Means

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

One thing that I've noticed programmers (including myself) do when comparing two versions of code is they run it once for both versions and compare the time it takes. If one is faster, then they conclude that that one is faster and take it.

It is possible that this conclusion is correct, that the faster one is indeed actually faster, however the evidence here is not very good. It is also highly possible that the faster one is only faster because some randomness in the execution time made it seem faster. Take a look at this picture:

Suppose this is the graph of the runtimes of two versions of the same function. The variance is unchanged at 16 (so the standard deviation is 4, which means that the average distance from the mean for a test is 4). The green version of the function has an average runtime that is twice as much as the red version, meaning that it is twice as slow. However it is highly possible that your test for the green function ends up in the left tail and the test for the red function ends up in the right tail, meaning that you'll wrongly conclude that the green version is faster. (Note that another good idea here would be to try and reduce the variance of your function).

In order to get a good idea of which one is actually faster, you need to do some analysis, and for this analysis you need to know the variance of the running time. Unfortunately you cannot get an estimate of the variance with one test (mathematically it results in a divide by zero) so you need to have several tests for both versions in order to create a decent sample. How many tests? That depends on a great many factors, but if we assume that the randomness is a random variable that normally distributed (I'll give a tiny bit of info at the end about what happens if it isn't) then the number of tests depends on what level of confidence you want in your results, and how precise you want your estimates to be. If you want to be 95% sure you are correct instead of 90% sure, you'll either need to accept a lower level of precision or run more tests. On the other hand if you want to be more precise, you'll need to either accept a lower level of confidence in your results, or run more tests.
Finally, the number of tests you'll need depends on the variance of the randomness in your tests. A higher variance will need more tests in order to be more precise.

So let's say we've got some data for our two versions of the code and we want to know which one is faster. Unfortunately it is not as simple as taking the average from one and subtracting the other, since this doesn't take into account the variance of the two distributions.

One way to do this is to compute a new dataset containing the differences between the two distributions. You would do this with the formula (assuming xt is the tth observation from one version of the code and yt is the tth observation from the second version):
dt = xt - yt
You can then take the average and variance of the ds and make a conclusion. We want to test if the difference is equal to zero, because if it is then we cannot say that there is a difference between these two samples. We do this by taking a t-statistic, which uses the following formula (assuming d is the difference, and n is the size of d):
t = average(d) / sqrt(variance(d) / n)
You then compare this value to the t-distribution with n - 1 degrees of freedom (I'll explain what the degrees of freedom are in a future post) to see what is the probability that we would get this t-statistic assuming that the actual mean of the difference is zero. If this probability is very low, then we can say with a certain level of confidence that we are wrong with our assumption that the mean of the difference is zero. This probability is called the p-value (some people call it the power of the test), and using R you can compute this p-value:
2 * pt( -abs(t), df = n - 1)
This will give you a value between 0 and 1, which is the probability that if the mean of the difference was actually zero, there is a p-value percent probability of us getting a sample that gives us this t-statistic. If this probability is really low, then we can say that it is more likely that our assumption of zero mean is wrong. If the difference is not zero, that means that one of the two versions of our code is faster, which one being the one with a lower average runtime.

You can also do this using confidence intervals. This is where you compute a range, and if zero is in this range then you cannot say that the difference is not zero. With a confidence interval you need some t-critical value for your level of confidence. If you want to be C% confident, you can use the formula in R to get the t-critical value (you divide by 2 because the qt() function in R is for a one-tailed test):
abs(qt((1 - C) / 2, df = n - 1))
You can then create your confidence interval:
lower_bound = average(d) - tcrit * sqrt(variance(d) / n)
upper_bound = average(d) + tcrit * sqrt(variance(d) / n)
If zero is in this interval, then you cannot say that there is a difference between the runtimes of the two functions.

A quick note on normality: we assume here that the randomness in the runtime is normally distributed. If we don't know that it is normal, that means that the "t-statistic" we calculate does not actually follow a t-distribution and the results we get from computing t-statistics and p-values is pretty meaningless. However this problem can be remedied by using a large sample, because as the sample size goes to infinity the distribution of our "t-statistic" converges to the normal distribution, and we can still use the p-value. This also requires a few assumptions, which I will talk about in later posts.

« Statistics For Programmers II: Estimators and Random Variables | Statistics For Programmers IV: Ordinary Least Squares »


Duane Johnson said...

The link back to the first article didn't work for me, so here is the link for anyone else in the same spot: http://lovehateubuntu.blogspot.com/2010/01/statistics-for-programmers-i-problem.html

Rob Britton said...

Whoops, thanks Duane! I fixed the link, looks like I must have spaced when writing it out.

Blues said...

Hi Rob,

These are good articles, but could you explain how to do the math without R? ie, if I wanted to do the calculations on paper, or write this in javascript, what are the formulae I'd use, and what's the logic behind them?

I currently use the following to get a 95% confidence interval:

1.96 σ/√n

Where n is the sample size and sigma is the standard deviation. I don't really understand the significance of the 1.96. It has its own wikipedia page, but a lot of it goes over my head

Thanks again for publishing all this. I actually spoke at a conference in Montreal this March about the statistics of performance measurement. My slides are here:

Feel free to critique the content.

Rob Britton said...

Hi Blues: Unfortunately the math is very hard to do without computer help. It involves taking the integral of a function that doesn't have an anti-derivative, which means you're going to need numerical approximation. Doing that by hand is extremely tedious, which is why we use computers for it!

Where does the 1.96 come from? Well, the area under the standard normal distribution (the bell curve with a mean of 0, and a standard deviation of 1) from -1.96 to 1.96 is equal to 0.95, which is the confidence level we want. In general if you want C% confidence, then you need to choose your critical point c (in the case of 95% confidence and the normal distribution c = 1.96) such that the area under the curve between -c and c is equal to C - I think this would be easier to show using a picture, but I'm not sure how to add a picture in a comment.

Intuitively, this means that if the population mean is zero and the standard deviation is σ/√n, then 95% of the time when we take a random sample, the sample average will fall between -1.96 * σ/√n and 1.96 * σ/√n. Therefore it is a low probability that when you take a sample it will be outside that range.

To do this via code, you'd need to do an approximation. You can see the Wikipedia article about this here which talks about how you approximate the area under a function using code. What you'd do is use the function for the normal distribution (the formula is 1/sqrt(2*pi) * exp(-x^2/2) ) and find a number c such that the area between -c and c under this function is equal to 0.95 (or since this is symmetric you can go from 0 to c and use 0.475 as your area). This would be an iterative procedure, you'd take an initial guess of c and then each iteration improve it so that the area under the curve between 0 and c would get closer and closer to 0.475 and quit when you get close enough.

Finally a small note, using 1.96 only works if you know what the population standard deviation σ is. If this is not the case you use the sample standard deviation s and you'd follow the t-distribution. Fortunately though for sample sizes greater than 30 or so the t-distribution is close enough to the normal distribution for you to use 1.96 regardless. It's just if your sample size is small, then 1.96 is actually too small for a 95% confidence interval.

Anyway let me know if that helps, I might post another article on how to calculate these critical values.

I looked at your slides, they look good! There was nothing really there that jumped out at me that was wrong. Next time you're in Montreal giving a talk, let me know!

Blues said...

Thanks, this explains things well. For most of my data I have sample sizes larger than 100, but there are a few with really small sample sizes, so I'm guessing that I should just not bother with those.