One thing that I've noticed programmers (including myself) do when comparing two versions of code is they run it once for both versions and compare the time it takes. If one is faster, then they conclude that that one is faster and take it.
It is possible that this conclusion is correct, that the faster one is indeed actually faster, however the evidence here is not very good. It is also highly possible that the faster one is only faster because some randomness in the execution time made it seem faster. Take a look at this picture:
Suppose this is the graph of the runtimes of two versions of the same function. The variance is unchanged at 16 (so the standard deviation is 4, which means that the average distance from the mean for a test is 4). The green version of the function has an average runtime that is twice as much as the red version, meaning that it is twice as slow. However it is highly possible that your test for the green function ends up in the left tail and the test for the red function ends up in the right tail, meaning that you'll wrongly conclude that the green version is faster. (Note that another good idea here would be to try and reduce the variance of your function).
In order to get a good idea of which one is actually faster, you need to do some analysis, and for this analysis you need to know the variance of the running time. Unfortunately you cannot get an estimate of the variance with one test (mathematically it results in a divide by zero) so you need to have several tests for both versions in order to create a decent sample. How many tests? That depends on a great many factors, but if we assume that the randomness is a random variable that normally distributed (I'll give a tiny bit of info at the end about what happens if it isn't) then the number of tests depends on what level of confidence you want in your results, and how precise you want your estimates to be. If you want to be 95% sure you are correct instead of 90% sure, you'll either need to accept a lower level of precision or run more tests. On the other hand if you want to be more precise, you'll need to either accept a lower level of confidence in your results, or run more tests.
Finally, the number of tests you'll need depends on the variance of the randomness in your tests. A higher variance will need more tests in order to be more precise.
So let's say we've got some data for our two versions of the code and we want to know which one is faster. Unfortunately it is not as simple as taking the average from one and subtracting the other, since this doesn't take into account the variance of the two distributions.
One way to do this is to compute a new dataset containing the differences between the two distributions. You would do this with the formula (assuming xt is the tth observation from one version of the code and yt is the tth observation from the second version):
dt = xt - ytYou can then take the average and variance of the ds and make a conclusion. We want to test if the difference is equal to zero, because if it is then we cannot say that there is a difference between these two samples. We do this by taking a t-statistic, which uses the following formula (assuming d is the difference, and n is the size of d):
t = average(d) / sqrt(variance(d) / n)You then compare this value to the t-distribution with n - 1 degrees of freedom (I'll explain what the degrees of freedom are in a future post) to see what is the probability that we would get this t-statistic assuming that the actual mean of the difference is zero. If this probability is very low, then we can say with a certain level of confidence that we are wrong with our assumption that the mean of the difference is zero. This probability is called the p-value (some people call it the power of the test), and using R you can compute this p-value:
2 * pt( -abs(t), df = n - 1)This will give you a value between 0 and 1, which is the probability that if the mean of the difference was actually zero, there is a p-value percent probability of us getting a sample that gives us this t-statistic. If this probability is really low, then we can say that it is more likely that our assumption of zero mean is wrong. If the difference is not zero, that means that one of the two versions of our code is faster, which one being the one with a lower average runtime.
You can also do this using confidence intervals. This is where you compute a range, and if zero is in this range then you cannot say that the difference is not zero. With a confidence interval you need some t-critical value for your level of confidence. If you want to be C% confident, you can use the formula in R to get the t-critical value (you divide by 2 because the qt() function in R is for a one-tailed test):
abs(qt((1 - C) / 2, df = n - 1))You can then create your confidence interval:
lower_bound = average(d) - tcrit * sqrt(variance(d) / n)If zero is in this interval, then you cannot say that there is a difference between the runtimes of the two functions.
upper_bound = average(d) + tcrit * sqrt(variance(d) / n)
A quick note on normality: we assume here that the randomness in the runtime is normally distributed. If we don't know that it is normal, that means that the "t-statistic" we calculate does not actually follow a t-distribution and the results we get from computing t-statistics and p-values is pretty meaningless. However this problem can be remedied by using a large sample, because as the sample size goes to infinity the distribution of our "t-statistic" converges to the normal distribution, and we can still use the p-value. This also requires a few assumptions, which I will talk about in later posts.
« Statistics For Programmers II: Estimators and Random Variables | Statistics For Programmers IV: Ordinary Least Squares »