Mar 15, 2010

Statistics For Programmers IV: Ordinary Least Squares

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

In my last post about statistics, I spoke about how to compare two different averages to see if they are statistically different or not. It was a good introduction to using statistics for hypothesis testing since it was fairly simple yet went over some of the fundamental ideas of statistics - that disturbances could result in us coming to wrong conclusions, and we must do our best to try and detect whether or not differences in results come from the disturbances or from what we believe it to be.

This time we'll address something a bit more complex. In many cases we want to analyze a causal relationship between two variables - that is, variation in one variable is causing variation in another variable.

Note that this post is a bit more theory-oriented, the next one will apply the theory to something programming-related.

In econometrics (something that I've been studying a fair bit over the last year or so), a tool used for this type of analysis is the ordinary least squares (OLS) method. This method assumes that the relationship between the variables can be explained by an equation that looks like this:
y = β0 + β1x + ε
Here, y is the dependent variable, x is the independent variable, β1 is the effect of a one unit change of x on y, β0 is the intercept term (basically what y is predicted to be when x is zero), and ε is a random error term. This example only uses a single independent variable, but you can have many more. You can also have functions of independent variables in there, so you could put x2 if you like. For OLS, the restriction is that the equation is linear in the β terms (called the parameters) - that is, OLS is a linear estimator.

Once we have formulated our model, we can use data to estimate what the β coefficients are. There are plenty of statistical packages out there that will do OLS on a dataset, R is one that is open-source and available for Ubuntu. It's not as user-friendly as other ones that I've used, but it is easier to get your hands on it.

This paragraph talks about how the estimates are computed, feel free to skip it if you don't care. Otherwise if you want to be hardcore and calculate out the coefficients yourself, the math formula is:
est(β) = (X'X)-1X'y
Here est(β) is the estimator of β, which is a column vector of the parameters in your equation (in the example above we had 2 parameters, so it would be a 2x1 vector). The X is a matrix of your independent variables, for the intercept term you just set one of the columns of your matrix to all 1s. Each independent variable occupies one column of your matrix, so the matrix is nxk for n observations and k independent variables (including an intercept term if you have one). X' is the transpose of X.

So now you know how to get take a bunch of data and infer a causal relationship between them, right? Unfortunately it is not entirely that simple. OLS comes with a whole pile of assumptions in order for you to properly analyze the results that come out:
  • The expected value of ε conditional on X is zero. That means that if you hold your independent variables fixed and just keep taking observations, the average error will be zero. It is common for this assumption to be violated, and when it is you usually end up with biased results. It is important to understand how it can be violated and what to do when it is. I'll talk about this more in a later post.
  • Random sampling - the different observations of a specific variable are not correlated with one another. This can be violated for dynamic systems where the observations are taken over time.
  • No perfect multicollinearity - this basically means that no independent variable is a perfect linear combination of other independent variables (for example if x5 = 2x2 - 3.14x7 you would have a perfect linear combination). You're pretty hosed if this is the case, and you'll probably need to end up removing a variable.
  • Homoskedasticity - given any value of X, the spread of your error term is constant. When this is violated you can still use OLS, it's just there will be better estimators.
  • No autocorrelation - none of the error terms are correlated with the other error terms. With autocorrelation you can still use OLS, but like heteroskedasticity there will be better estimators.
  • Normal errors - your errors must be distributed normally with a constant variance. Note that if this is the case then you will have homooskedasticity and no autocorrelation. If this assumption is violated, then you cannot use the hypothesis tests in the same way.

Assuming all the assumptions are satisfied, we can do hypothesis testing on our estimates of the parameters. Fortunately the hypothesis testing here is even easier, since statistical packages usually spit out the p-value for each parameter. What that p-value measures is the probability that if the actual value of the coefficient were zero, then we would have p-value probability of getting the sample that we did. So if the p-value is very small, you can guess that the actual value for the parameter is not zero.

Anyway like I said before, this post was a bit dry and theoretical. I'll get to something more practical next time.

« Statistics For Programmers III: Differences of Means | Statistics For Programmers V: Performance Analysis »

No comments: