Showing posts with label SfPs. Show all posts
Showing posts with label SfPs. Show all posts

Apr 8, 2010

Statistics For Programmers VI: Omitted Variable Bias

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

In the fourth post in this series, I spoke briefly about the assumptions required for using regression to analyze your data. One of them was about the error term having zero mean conditional on your independent variables - this means that if you hold your independent variables fixed, then the average of your errors should be close to zero.

There are certain situations when this assumption can be violated. One common one that I'll be talking about today (obviously, given the title of the post) is called omitted variable bias. This occurs when you don't include an independent variable that has an effect on the dependent variable, and also is correlated with other independent variables (I believe omitted variables are also called confounding variables). What happens when you use ordinary least squares is that your regression doesn't know that the variation in the omitted variable (let's call it c) is causing changes in the dependent variable (call it y). Assuming that c is correlated with an independent variable x, the regression cannot tell whether the variation in y is being caused by x or by c since you haven't included c in the model. Therefore it will attribute all of that variation to x.

Let's look at an example. Suppose you've been arguing with someone on reddit about agile vs. waterfall and you want to prove which one of these makes programmers more productive (while I don't know if one is actually better or not, let's assume for illustration purposes that one of them is). Now suppose you've read all my previous posts on statistics, and you've taken a course in statistics in university, so you feel fairly confident in your ability to do statistical analysis. You go out and start collecting some data from various companies (suppose they cooperate and give you their data, and that it is reliable). You collect various types of data: how big their team is, the experience of the team members, the language(s) and libraries they're using, and finally whether they are using agile or waterfall. Also suppose you have some reliable way of measuring productivity (this is another one we'll have to assume, since measuring programmer productivity isn't that easy). So you run a regression on all this and it turns out that agile has a significantly larger effect on programmer productivity than waterfall. You post your results on reddit and say, "Ha! The data says so!"

Unfortunately there is a problem here. What you haven't included is the average ability of the programmers at the various companies. It is fairly certain that the average ability of your programmers will impact productivity (of course that will require you to hold constant any synergistic effects between programmers - see, this stuff is hard!). Again for illustration purposes let's pretend that hot-shot programmers like to work in agile environments rather than waterfall environments, so the average ability for programmers will be higher for agile than for waterfall. So we have an omitted variable that is correlated with both the dependent variable (productivity) and the independent variable we are interested in (agile vs. waterfall). This will bias our estimate for the effect of agile vs. waterfall, and the size of the bias typically looks like this:
bias = (effect of average ability on productivity) *
(correlation between average ability and agile vs. waterfall)
Since these are both assumed to be positive values, the bias will be positive (since I'm assuming ability has a large effect on productivity, this bias will also be large) - the effect of agile will be very much overstated in your analysis.

Of course, you could always decide to lie with your statistics and hope that your target audience doesn't know very much about statistics to properly argue against your results (especially if you include math, they might not even read the analysis!). This is not a very respectable thing to do, but that doesn't stop people from doing it anyway - how many statistics do you see in the newspaper, magazines, blogs etc. that do not include a standard error and confidence interval? or when they say average/mean, do they say which type of average/mean? or do they tell you their sample selection methods? I could go on like this for a while.

Anyway let's assume that you decide to do everything legit and you want to control for programmer ability. The problem with this is how do you give an objective number to each programmer? Sometimes in labour economics worker ability is treated as an unobserved variable, which is a variable for which you cannot get a good quantitative value. So how do you go about measuring this? There are some techniques that have been developed like instrumental variables which can help, but typically you're going to have to make some trade-off between biased results and imprecise results.
In fact, productivity may also be considered an unobserved variable!

On another note, I'd like to get some feedback on these posts. Do people find them helpful, or am I just boring you with what you learned in your stats classes?

« Statistics For Programmers V: Performance Analysis

Mar 24, 2010

Statistics For Programmers V: Performance Analysis

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

In the last post, I covered a little bit of theory on how ordinary least squares works. This is a method of fitting a curve to a set of points so that you can get estimates of how independent variables affect a dependent variable.

In this post, we'll use that technique in a programming-related example. Let's say we have a task of optimizing code. There's a certain section that is too slow, and we need to figure out why that section is slow and fix it up. Normally 99% of the time, the slow part is in our own code, or it might be in code that you can access and change (gotta love the prevalence of open-source these days!). However in our hypothetical scenario here you're working with a library that is not open-source, meaning that there's not really any way for you to look at the code to analyze it properly. So what do you do? Well you could roll up your sleeves, fire up ndisasm (or whatever disassembler you have handy) and take a peek at the assembly code. While this is possible, I probably wouldn't recommend it because you'll be spending so much time digging through everything that you'll be at it all day just trying to figure out what is going on. Then on top of that you'll have to do your performance analysis, which a lot of the time requires some probability anyway.

So, why not attempt to use statistics for it? If you are using a library then chances are the functions/methods you are using have some sort of inputs that they use (the independent variables) and you have some kind of measurable (usually) dependent variable based on those inputs - like the amount of time it takes the function to run, how much memory it uses, etc. What you could do is take various sets of possible input values, and record the dependent variable(s). Then afterward you form a hypothesis and you can test it using ordinary least squares.

What do I mean by form a hypothesis? Well, let's say that you are fairly certain that the algorithm is O(n2). What you could then do is set up your regression equation like this:
T = β0 + β1n + β3n2 + ε
After that you could just run your code over and over with different values of n and collect the corresponding execution times. You can then run the regression on the set of data to get the estimates for the coefficients.

Once you run the regression, most statistical packages will spit out a standard error for each coefficient. This standard error is the estimated standard deviation of the distribution of each coefficient. What does that mean in English? Well, you have an actual value for each coefficient called β, which unfortunately you don't know. If you take a sample of a certain size and assuming all the assumptions are satisfied, it will effectively take a random value from around β that is distributed according to a t-distribution. This distribution has a standard deviation, which again we unfortunately don't know. The standard error is effectively our guess of that standard deviation based on the sample we have.
What does the standard error do? It determines how precise our estimate of β is. If our estimate of β is 100 and our standard error is 0.1, then it's nearly a 100% chance that the actual value of β is between 99 and 101. However if the standard error is 10, then the range that the actual β could be in is much larger. You can get smaller standard errors by increasing your sample size, or by getting a wider range of values for your independent variables.
Note that when some of our assumptions are violated, it can potentially have a huge effect on the standard errors.

We can use the estimates and the standard errors to form hypotheses about the actual values of the coefficients. Suppose our estimate of β is 5, and we want to test to see if the actual value of β is 6. We would then form a hypothesis that β = 6, this is called the null hypothesis. You compute a t-statistic for this null hypothesis using the formula:
est(β) - β) / st.err
If this t-statistic is small (the threshold depends on your sample size and how confident you want to be in your results) then you cannot reject the null hypothesis - this does not mean that the null hypothesis is true, it just means that this sample is not enough evidence to say that it is false. It could be true that β = 5.5, and we could still get a sample like this (depending on the standard error of course). So if the t-statistic is small, we can't say a whole lot about our null hypothesis.
On the other hand if the t-statistic is large, it provides a good amount of evidence that the null hypothesis is wrong. So typically when you test a hypothesis, you let the null hypothesis be something that you want to "prove" wrong (I say "prove" since you can't really have 100% proof in statistics like you can in math).
Most stats packages like R or even Excel will spit out a calculated t-statistic for each coefficient. What this value represents is the t-statistic for the null hypothesis of the coefficient being equal to zero. If that t-statistic is small, we say that the coefficient is not statistically significant - that is, we can't really say that it is any different from zero.

What does all this mean for testing the code? Well it lets us get an idea of how fast the code runs. Suppose we get an estimated equation of:
T = 100 + 30n + 0.5n2
Provided that our standard errors for the coefficients are fairly small, this equation shows us several things. First, it shows that there is a fixed cost to the code. It doesn't really matter the size of the input, we still have 100 there each time you run it. It also shows a fairly small estimate on the n2 value. This is useful for comparing this algorithm to another. Suppose you come up with a fancy new algorithm that you figure out runs in O(n) time. This is theoretically better than the O(n2) algorithm, but if the equation looks like this:
T = 2000 + 400n
then the O(n2) algorithm is much better if you're only ever working with small values of n.

So this is a basic introduction on how to use statistics to test performance analysis. Unfortunately, there are still plenty of things that can go wrong. I listed a few assumptions in the last post, if those are incorrect then you can have problems such as biasedness and inefficiency (I'll talk about these more in later posts). Also if your model is misspecified (if perhaps the algorithm is actually O(nlog(n)) instead of O(n2)) then you probably won't get correct results. Fortunately there are tests for this kind of thing that you can do, which I might cover in a later post.

« Statistics For Programmers IV: Ordinary Least Squares | Statistics For Programmers VI: Omitted Variable Bias »

Mar 15, 2010

Statistics For Programmers IV: Ordinary Least Squares

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

In my last post about statistics, I spoke about how to compare two different averages to see if they are statistically different or not. It was a good introduction to using statistics for hypothesis testing since it was fairly simple yet went over some of the fundamental ideas of statistics - that disturbances could result in us coming to wrong conclusions, and we must do our best to try and detect whether or not differences in results come from the disturbances or from what we believe it to be.

This time we'll address something a bit more complex. In many cases we want to analyze a causal relationship between two variables - that is, variation in one variable is causing variation in another variable.

Note that this post is a bit more theory-oriented, the next one will apply the theory to something programming-related.

In econometrics (something that I've been studying a fair bit over the last year or so), a tool used for this type of analysis is the ordinary least squares (OLS) method. This method assumes that the relationship between the variables can be explained by an equation that looks like this:
y = β0 + β1x + ε
Here, y is the dependent variable, x is the independent variable, β1 is the effect of a one unit change of x on y, β0 is the intercept term (basically what y is predicted to be when x is zero), and ε is a random error term. This example only uses a single independent variable, but you can have many more. You can also have functions of independent variables in there, so you could put x2 if you like. For OLS, the restriction is that the equation is linear in the β terms (called the parameters) - that is, OLS is a linear estimator.

Once we have formulated our model, we can use data to estimate what the β coefficients are. There are plenty of statistical packages out there that will do OLS on a dataset, R is one that is open-source and available for Ubuntu. It's not as user-friendly as other ones that I've used, but it is easier to get your hands on it.

This paragraph talks about how the estimates are computed, feel free to skip it if you don't care. Otherwise if you want to be hardcore and calculate out the coefficients yourself, the math formula is:
est(β) = (X'X)-1X'y
Here est(β) is the estimator of β, which is a column vector of the parameters in your equation (in the example above we had 2 parameters, so it would be a 2x1 vector). The X is a matrix of your independent variables, for the intercept term you just set one of the columns of your matrix to all 1s. Each independent variable occupies one column of your matrix, so the matrix is nxk for n observations and k independent variables (including an intercept term if you have one). X' is the transpose of X.

So now you know how to get take a bunch of data and infer a causal relationship between them, right? Unfortunately it is not entirely that simple. OLS comes with a whole pile of assumptions in order for you to properly analyze the results that come out:
  • The expected value of ε conditional on X is zero. That means that if you hold your independent variables fixed and just keep taking observations, the average error will be zero. It is common for this assumption to be violated, and when it is you usually end up with biased results. It is important to understand how it can be violated and what to do when it is. I'll talk about this more in a later post.
  • Random sampling - the different observations of a specific variable are not correlated with one another. This can be violated for dynamic systems where the observations are taken over time.
  • No perfect multicollinearity - this basically means that no independent variable is a perfect linear combination of other independent variables (for example if x5 = 2x2 - 3.14x7 you would have a perfect linear combination). You're pretty hosed if this is the case, and you'll probably need to end up removing a variable.
  • Homoskedasticity - given any value of X, the spread of your error term is constant. When this is violated you can still use OLS, it's just there will be better estimators.
  • No autocorrelation - none of the error terms are correlated with the other error terms. With autocorrelation you can still use OLS, but like heteroskedasticity there will be better estimators.
  • Normal errors - your errors must be distributed normally with a constant variance. Note that if this is the case then you will have homooskedasticity and no autocorrelation. If this assumption is violated, then you cannot use the hypothesis tests in the same way.

Assuming all the assumptions are satisfied, we can do hypothesis testing on our estimates of the parameters. Fortunately the hypothesis testing here is even easier, since statistical packages usually spit out the p-value for each parameter. What that p-value measures is the probability that if the actual value of the coefficient were zero, then we would have p-value probability of getting the sample that we did. So if the p-value is very small, you can guess that the actual value for the parameter is not zero.

Anyway like I said before, this post was a bit dry and theoretical. I'll get to something more practical next time.

« Statistics For Programmers III: Differences of Means | Statistics For Programmers V: Performance Analysis »

Feb 15, 2010

Statistics For Programmers III: Differences of Means

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

One thing that I've noticed programmers (including myself) do when comparing two versions of code is they run it once for both versions and compare the time it takes. If one is faster, then they conclude that that one is faster and take it.

It is possible that this conclusion is correct, that the faster one is indeed actually faster, however the evidence here is not very good. It is also highly possible that the faster one is only faster because some randomness in the execution time made it seem faster. Take a look at this picture:


Suppose this is the graph of the runtimes of two versions of the same function. The variance is unchanged at 16 (so the standard deviation is 4, which means that the average distance from the mean for a test is 4). The green version of the function has an average runtime that is twice as much as the red version, meaning that it is twice as slow. However it is highly possible that your test for the green function ends up in the left tail and the test for the red function ends up in the right tail, meaning that you'll wrongly conclude that the green version is faster. (Note that another good idea here would be to try and reduce the variance of your function).

In order to get a good idea of which one is actually faster, you need to do some analysis, and for this analysis you need to know the variance of the running time. Unfortunately you cannot get an estimate of the variance with one test (mathematically it results in a divide by zero) so you need to have several tests for both versions in order to create a decent sample. How many tests? That depends on a great many factors, but if we assume that the randomness is a random variable that normally distributed (I'll give a tiny bit of info at the end about what happens if it isn't) then the number of tests depends on what level of confidence you want in your results, and how precise you want your estimates to be. If you want to be 95% sure you are correct instead of 90% sure, you'll either need to accept a lower level of precision or run more tests. On the other hand if you want to be more precise, you'll need to either accept a lower level of confidence in your results, or run more tests.
Finally, the number of tests you'll need depends on the variance of the randomness in your tests. A higher variance will need more tests in order to be more precise.

So let's say we've got some data for our two versions of the code and we want to know which one is faster. Unfortunately it is not as simple as taking the average from one and subtracting the other, since this doesn't take into account the variance of the two distributions.

One way to do this is to compute a new dataset containing the differences between the two distributions. You would do this with the formula (assuming xt is the tth observation from one version of the code and yt is the tth observation from the second version):
dt = xt - yt
You can then take the average and variance of the ds and make a conclusion. We want to test if the difference is equal to zero, because if it is then we cannot say that there is a difference between these two samples. We do this by taking a t-statistic, which uses the following formula (assuming d is the difference, and n is the size of d):
t = average(d) / sqrt(variance(d) / n)
You then compare this value to the t-distribution with n - 1 degrees of freedom (I'll explain what the degrees of freedom are in a future post) to see what is the probability that we would get this t-statistic assuming that the actual mean of the difference is zero. If this probability is very low, then we can say with a certain level of confidence that we are wrong with our assumption that the mean of the difference is zero. This probability is called the p-value (some people call it the power of the test), and using R you can compute this p-value:
2 * pt( -abs(t), df = n - 1)
This will give you a value between 0 and 1, which is the probability that if the mean of the difference was actually zero, there is a p-value percent probability of us getting a sample that gives us this t-statistic. If this probability is really low, then we can say that it is more likely that our assumption of zero mean is wrong. If the difference is not zero, that means that one of the two versions of our code is faster, which one being the one with a lower average runtime.

You can also do this using confidence intervals. This is where you compute a range, and if zero is in this range then you cannot say that the difference is not zero. With a confidence interval you need some t-critical value for your level of confidence. If you want to be C% confident, you can use the formula in R to get the t-critical value (you divide by 2 because the qt() function in R is for a one-tailed test):
abs(qt((1 - C) / 2, df = n - 1))
You can then create your confidence interval:
lower_bound = average(d) - tcrit * sqrt(variance(d) / n)
upper_bound = average(d) + tcrit * sqrt(variance(d) / n)
If zero is in this interval, then you cannot say that there is a difference between the runtimes of the two functions.

A quick note on normality: we assume here that the randomness in the runtime is normally distributed. If we don't know that it is normal, that means that the "t-statistic" we calculate does not actually follow a t-distribution and the results we get from computing t-statistics and p-values is pretty meaningless. However this problem can be remedied by using a large sample, because as the sample size goes to infinity the distribution of our "t-statistic" converges to the normal distribution, and we can still use the p-value. This also requires a few assumptions, which I will talk about in later posts.

« Statistics For Programmers II: Estimators and Random Variables | Statistics For Programmers IV: Ordinary Least Squares »

Jan 25, 2010

Statistics For Programmers II: Estimators and Random Variables

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

The whole point of statistics is that there are some things that we don't know, and we can't really measure those things - it might be because it costs too much, or we have a lot of measurement error, or it is actually impossible to accurately measure something. With statistics we take a sample of something, and attempt to derive results based on that sample about the population as a whole. For example, you might want to test how long it takes for your code to process something. Unfortunately if you run it over and over, you're not going to get the same result each time. How do you know what the actual value is? And more importantly, how do you compare that actual value with other actual values? Another example would be that you want to compare the processing time across two different commits, how do you know that one is faster than the other? You could run it once for each one and see which one is faster, however the problem with that is you could just have a statistical blip and draw a wrong conclusion. Statistics will help you decide to a certain probability which one is actually faster.

Now we've got these unknown values that we would like to know about, how do we go about getting information about them? We do this using things called estimators. An estimator is basically a function which takes as an input your sample and perhaps a few other values, and returns an estimate of the value you are trying to know about. One of the main things I will be talking about over this series is how to obtain these estimators, and how to analyze them to find out how good of an estimate they produce.
I've also mentioned this idea of a sample, for the most part I will be referring to a random sample. This means that out of the set of all samples, you grab them at random.

One of the major mistakes when people do statistics is to assume that estimators can be manipulated just like other functions. This is incorrect, because usually the estimator is an example of a random variable, often because it is a function of a random error term (which as has been pointed out however, is not always random). This random error term is the combination of all the unobserved aspects of a variable, for example when you're measuring the execution speed of a program the random error might be caused by how many other processes are running, how much CPU time those processes used, how hot your computer is running, that one in a million time that something had to go swap, etc. Basically anything that you do not explicitly put into your equation is captured by the random error.
What is this random variable thing? A random variable is a variable that doesn't really take a single value, like regular variables do. Instead each time you look at it, it will have a different value. You analyze these by associating a probability distribution to the random variable, which lets you calculate certain properties. Examples include the binomial distribution (non-weighted dice follow this one), or the normal distribution. For a more precise definition of a random variable, look on Wikipedia.
There are certain hiccups when it comes to working with random variables in that the standard mathematical operations don't quite work the same as with regular variables. For example if you do a least squares regression (I'll explain this in detail in a later post) to obtain an estimate of ln(y), you can't just take it to the power e to get an estimate of y. It doesn't work that way. This type of thing comes up with you have heteroskedasticity, which is when the variance of the error term is a function of something you are measuring - for example, they could be linearly correlated.

This entry was a bit boring and very theory-heavy. The next entry will have some practical examples that you can use with the programs you write.

« Statistics For Programmers I: The Problem | Statistics For Programmers III: Differences of Means »

Jan 24, 2010

Statistics For Programmers I: The Problem

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics.

After seeing Greg Wilson's talk at CUSEC, and reading Zed Shaw's rant about statistics, I think it is time to write up a little bit for all of you on how to do this stuff so that you don't make too many mistakes. If you've gone to university, then you probably did a class on statistics (from my very small sample size of computer science programs, Bishop's and Concordia both require a statistics class, McGill has one in the "choose one of the following 4 classes"). If you didn't do university chances are you haven't seen much statistics other than really basic stuff like means/variance and z-scores.

In either case, I don't think the level of statistics given really shows you how to use and analyze them properly - kinda like how an intro Java class really doesn't show you how to use and analyze programs written in Java, you need to practice it on your own. The difference is that you tend to use your Java/other-random-language a lot more than you would statistics in a computer science program. Furthermore, statistics has a lot more pitfalls in it than Java programming does, and these pitfalls don't really jump up and smack you in the face like they do in programming. Instead of having a program crashing or being difficult to maintain once you begin to scale (which are very obvious when you hit them) the problems with statistics lead you to wrong results, and it is quite difficult to tell when you are wrong unless you actually know how to analyze your results. Finally, even when you do know your statistics well, your human biases can kick in and say that you might be doing it wrong because the results don't match up with what you believe to be true (see cargo cult science).

Oh, and even if you do everything right, you might just be unlucky and get a crappy sample. Gotta love statistics.

So what is the problem? Well, there are actually two problems. The first one is that most programmers out there seem to have an insufficient education in statistics to use it properly. The second problem is that programmers don't seem to use statistics to properly evaluate their claims. Unfortunately I can really only do something about the first claim, and that is the goal of this mini-series. I can attempt to write some basic software that helps analyze statistics, spits out some numbers, etc., but there is only so much that I can do there. How does that old saying go, "you can lead a horse to water but you can't make it drink"? Something like that.

Did anybody spot the hypocritical aspect in the last paragraph? If you did, good job! Basically my conclusions on both problems are based on my own anecdotal evidence, and what I've heard smart people say at conferences or on their blogs. This isn't really a great sample from which I can draw conclusions, so my results may be quite incorrect. However, even if I am wrong and most programmers know and use statistics well, I figure it couldn't hurt to talk about it anyway and perhaps people will tell me where I am wrong.

Finally, I don't claim to be an expert in statistics. In fact, the more I learn about statistics, the more I realize how little I have learned! Basically my knowledge of statistics is based on the classes I have taken in my economics program – four undergrad classes to date, currently in a grad class, however the classes are on econometrics which I don't think is quite the same as statistics but still deals with analyzing messy data to extract information.

Statistics For Programmers II: Estimators and Random Variables »