Jan 24, 2010

Statistics For Programmers I: The Problem

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics.

After seeing Greg Wilson's talk at CUSEC, and reading Zed Shaw's rant about statistics, I think it is time to write up a little bit for all of you on how to do this stuff so that you don't make too many mistakes. If you've gone to university, then you probably did a class on statistics (from my very small sample size of computer science programs, Bishop's and Concordia both require a statistics class, McGill has one in the "choose one of the following 4 classes"). If you didn't do university chances are you haven't seen much statistics other than really basic stuff like means/variance and z-scores.

In either case, I don't think the level of statistics given really shows you how to use and analyze them properly - kinda like how an intro Java class really doesn't show you how to use and analyze programs written in Java, you need to practice it on your own. The difference is that you tend to use your Java/other-random-language a lot more than you would statistics in a computer science program. Furthermore, statistics has a lot more pitfalls in it than Java programming does, and these pitfalls don't really jump up and smack you in the face like they do in programming. Instead of having a program crashing or being difficult to maintain once you begin to scale (which are very obvious when you hit them) the problems with statistics lead you to wrong results, and it is quite difficult to tell when you are wrong unless you actually know how to analyze your results. Finally, even when you do know your statistics well, your human biases can kick in and say that you might be doing it wrong because the results don't match up with what you believe to be true (see cargo cult science).

Oh, and even if you do everything right, you might just be unlucky and get a crappy sample. Gotta love statistics.

So what is the problem? Well, there are actually two problems. The first one is that most programmers out there seem to have an insufficient education in statistics to use it properly. The second problem is that programmers don't seem to use statistics to properly evaluate their claims. Unfortunately I can really only do something about the first claim, and that is the goal of this mini-series. I can attempt to write some basic software that helps analyze statistics, spits out some numbers, etc., but there is only so much that I can do there. How does that old saying go, "you can lead a horse to water but you can't make it drink"? Something like that.

Did anybody spot the hypocritical aspect in the last paragraph? If you did, good job! Basically my conclusions on both problems are based on my own anecdotal evidence, and what I've heard smart people say at conferences or on their blogs. This isn't really a great sample from which I can draw conclusions, so my results may be quite incorrect. However, even if I am wrong and most programmers know and use statistics well, I figure it couldn't hurt to talk about it anyway and perhaps people will tell me where I am wrong.

Finally, I don't claim to be an expert in statistics. In fact, the more I learn about statistics, the more I realize how little I have learned! Basically my knowledge of statistics is based on the classes I have taken in my economics program – four undergrad classes to date, currently in a grad class, however the classes are on econometrics which I don't think is quite the same as statistics but still deals with analyzing messy data to extract information.

Statistics For Programmers II: Estimators and Random Variables »

No comments: