In the fourth post in this series, I spoke briefly about the assumptions required for using regression to analyze your data. One of them was about the error term having zero mean conditional on your independent variables - this means that if you hold your independent variables fixed, then the average of your errors should be close to zero.
There are certain situations when this assumption can be violated. One common one that I'll be talking about today (obviously, given the title of the post) is called omitted variable bias. This occurs when you don't include an independent variable that has an effect on the dependent variable, and also is correlated with other independent variables (I believe omitted variables are also called confounding variables). What happens when you use ordinary least squares is that your regression doesn't know that the variation in the omitted variable (let's call it c) is causing changes in the dependent variable (call it y). Assuming that c is correlated with an independent variable x, the regression cannot tell whether the variation in y is being caused by x or by c since you haven't included c in the model. Therefore it will attribute all of that variation to x.
Let's look at an example. Suppose you've been arguing with someone on reddit about agile vs. waterfall and you want to prove which one of these makes programmers more productive (while I don't know if one is actually better or not, let's assume for illustration purposes that one of them is). Now suppose you've read all my previous posts on statistics, and you've taken a course in statistics in university, so you feel fairly confident in your ability to do statistical analysis. You go out and start collecting some data from various companies (suppose they cooperate and give you their data, and that it is reliable). You collect various types of data: how big their team is, the experience of the team members, the language(s) and libraries they're using, and finally whether they are using agile or waterfall. Also suppose you have some reliable way of measuring productivity (this is another one we'll have to assume, since measuring programmer productivity isn't that easy). So you run a regression on all this and it turns out that agile has a significantly larger effect on programmer productivity than waterfall. You post your results on reddit and say, "Ha! The data says so!"
Unfortunately there is a problem here. What you haven't included is the average ability of the programmers at the various companies. It is fairly certain that the average ability of your programmers will impact productivity (of course that will require you to hold constant any synergistic effects between programmers - see, this stuff is hard!). Again for illustration purposes let's pretend that hot-shot programmers like to work in agile environments rather than waterfall environments, so the average ability for programmers will be higher for agile than for waterfall. So we have an omitted variable that is correlated with both the dependent variable (productivity) and the independent variable we are interested in (agile vs. waterfall). This will bias our estimate for the effect of agile vs. waterfall, and the size of the bias typically looks like this:
bias = (effect of average ability on productivity) *Since these are both assumed to be positive values, the bias will be positive (since I'm assuming ability has a large effect on productivity, this bias will also be large) - the effect of agile will be very much overstated in your analysis.
(correlation between average ability and agile vs. waterfall)
Of course, you could always decide to lie with your statistics and hope that your target audience doesn't know very much about statistics to properly argue against your results (especially if you include math, they might not even read the analysis!). This is not a very respectable thing to do, but that doesn't stop people from doing it anyway - how many statistics do you see in the newspaper, magazines, blogs etc. that do not include a standard error and confidence interval? or when they say average/mean, do they say which type of average/mean? or do they tell you their sample selection methods? I could go on like this for a while.
Anyway let's assume that you decide to do everything legit and you want to control for programmer ability. The problem with this is how do you give an objective number to each programmer? Sometimes in labour economics worker ability is treated as an unobserved variable, which is a variable for which you cannot get a good quantitative value. So how do you go about measuring this? There are some techniques that have been developed like instrumental variables which can help, but typically you're going to have to make some trade-off between biased results and imprecise results.
In fact, productivity may also be considered an unobserved variable!
On another note, I'd like to get some feedback on these posts. Do people find them helpful, or am I just boring you with what you learned in your stats classes?
« Statistics For Programmers V: Performance Analysis