Mar 24, 2010

Statistics For Programmers V: Performance Analysis

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

In the last post, I covered a little bit of theory on how ordinary least squares works. This is a method of fitting a curve to a set of points so that you can get estimates of how independent variables affect a dependent variable.

In this post, we'll use that technique in a programming-related example. Let's say we have a task of optimizing code. There's a certain section that is too slow, and we need to figure out why that section is slow and fix it up. Normally 99% of the time, the slow part is in our own code, or it might be in code that you can access and change (gotta love the prevalence of open-source these days!). However in our hypothetical scenario here you're working with a library that is not open-source, meaning that there's not really any way for you to look at the code to analyze it properly. So what do you do? Well you could roll up your sleeves, fire up ndisasm (or whatever disassembler you have handy) and take a peek at the assembly code. While this is possible, I probably wouldn't recommend it because you'll be spending so much time digging through everything that you'll be at it all day just trying to figure out what is going on. Then on top of that you'll have to do your performance analysis, which a lot of the time requires some probability anyway.

So, why not attempt to use statistics for it? If you are using a library then chances are the functions/methods you are using have some sort of inputs that they use (the independent variables) and you have some kind of measurable (usually) dependent variable based on those inputs - like the amount of time it takes the function to run, how much memory it uses, etc. What you could do is take various sets of possible input values, and record the dependent variable(s). Then afterward you form a hypothesis and you can test it using ordinary least squares.

What do I mean by form a hypothesis? Well, let's say that you are fairly certain that the algorithm is O(n2). What you could then do is set up your regression equation like this:
T = β0 + β1n + β3n2 + ε
After that you could just run your code over and over with different values of n and collect the corresponding execution times. You can then run the regression on the set of data to get the estimates for the coefficients.

Once you run the regression, most statistical packages will spit out a standard error for each coefficient. This standard error is the estimated standard deviation of the distribution of each coefficient. What does that mean in English? Well, you have an actual value for each coefficient called β, which unfortunately you don't know. If you take a sample of a certain size and assuming all the assumptions are satisfied, it will effectively take a random value from around β that is distributed according to a t-distribution. This distribution has a standard deviation, which again we unfortunately don't know. The standard error is effectively our guess of that standard deviation based on the sample we have.
What does the standard error do? It determines how precise our estimate of β is. If our estimate of β is 100 and our standard error is 0.1, then it's nearly a 100% chance that the actual value of β is between 99 and 101. However if the standard error is 10, then the range that the actual β could be in is much larger. You can get smaller standard errors by increasing your sample size, or by getting a wider range of values for your independent variables.
Note that when some of our assumptions are violated, it can potentially have a huge effect on the standard errors.

We can use the estimates and the standard errors to form hypotheses about the actual values of the coefficients. Suppose our estimate of β is 5, and we want to test to see if the actual value of β is 6. We would then form a hypothesis that β = 6, this is called the null hypothesis. You compute a t-statistic for this null hypothesis using the formula:
est(β) - β) / st.err
If this t-statistic is small (the threshold depends on your sample size and how confident you want to be in your results) then you cannot reject the null hypothesis - this does not mean that the null hypothesis is true, it just means that this sample is not enough evidence to say that it is false. It could be true that β = 5.5, and we could still get a sample like this (depending on the standard error of course). So if the t-statistic is small, we can't say a whole lot about our null hypothesis.
On the other hand if the t-statistic is large, it provides a good amount of evidence that the null hypothesis is wrong. So typically when you test a hypothesis, you let the null hypothesis be something that you want to "prove" wrong (I say "prove" since you can't really have 100% proof in statistics like you can in math).
Most stats packages like R or even Excel will spit out a calculated t-statistic for each coefficient. What this value represents is the t-statistic for the null hypothesis of the coefficient being equal to zero. If that t-statistic is small, we say that the coefficient is not statistically significant - that is, we can't really say that it is any different from zero.

What does all this mean for testing the code? Well it lets us get an idea of how fast the code runs. Suppose we get an estimated equation of:
T = 100 + 30n + 0.5n2
Provided that our standard errors for the coefficients are fairly small, this equation shows us several things. First, it shows that there is a fixed cost to the code. It doesn't really matter the size of the input, we still have 100 there each time you run it. It also shows a fairly small estimate on the n2 value. This is useful for comparing this algorithm to another. Suppose you come up with a fancy new algorithm that you figure out runs in O(n) time. This is theoretically better than the O(n2) algorithm, but if the equation looks like this:
T = 2000 + 400n
then the O(n2) algorithm is much better if you're only ever working with small values of n.

So this is a basic introduction on how to use statistics to test performance analysis. Unfortunately, there are still plenty of things that can go wrong. I listed a few assumptions in the last post, if those are incorrect then you can have problems such as biasedness and inefficiency (I'll talk about these more in later posts). Also if your model is misspecified (if perhaps the algorithm is actually O(nlog(n)) instead of O(n2)) then you probably won't get correct results. Fortunately there are tests for this kind of thing that you can do, which I might cover in a later post.

« Statistics For Programmers IV: Ordinary Least Squares | Statistics For Programmers VI: Omitted Variable Bias »

No comments: