This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.
The whole point of statistics is that there are some things that we don't know, and we can't really measure those things - it might be because it costs too much, or we have a lot of measurement error, or it is actually impossible to accurately measure something. With statistics we take a sample of something, and attempt to derive results based on that sample about the population as a whole. For example, you might want to test how long it takes for your code to process something. Unfortunately if you run it over and over, you're not going to get the same result each time. How do you know what the actual value is? And more importantly, how do you compare that actual value with other actual values? Another example would be that you want to compare the processing time across two different commits, how do you know that one is faster than the other? You could run it once for each one and see which one is faster, however the problem with that is you could just have a statistical blip and draw a wrong conclusion. Statistics will help you decide to a certain probability which one is actually faster.
Now we've got these unknown values that we would like to know about, how do we go about getting information about them? We do this using things called estimators. An estimator is basically a function which takes as an input your sample and perhaps a few other values, and returns an estimate of the value you are trying to know about. One of the main things I will be talking about over this series is how to obtain these estimators, and how to analyze them to find out how good of an estimate they produce.
I've also mentioned this idea of a sample, for the most part I will be referring to a random sample. This means that out of the set of all samples, you grab them at random.
One of the major mistakes when people do statistics is to assume that estimators can be manipulated just like other functions. This is incorrect, because usually the estimator is an example of a random variable, often because it is a function of a random error term (which as has been pointed out however, is not always random). This random error term is the combination of all the unobserved aspects of a variable, for example when you're measuring the execution speed of a program the random error might be caused by how many other processes are running, how much CPU time those processes used, how hot your computer is running, that one in a million time that something had to go swap, etc. Basically anything that you do not explicitly put into your equation is captured by the random error.
What is this random variable thing? A random variable is a variable that doesn't really take a single value, like regular variables do. Instead each time you look at it, it will have a different value. You analyze these by associating a probability distribution to the random variable, which lets you calculate certain properties. Examples include the binomial distribution (non-weighted dice follow this one), or the normal distribution. For a more precise definition of a random variable, look on Wikipedia.
There are certain hiccups when it comes to working with random variables in that the standard mathematical operations don't quite work the same as with regular variables. For example if you do a least squares regression (I'll explain this in detail in a later post) to obtain an estimate of ln(y), you can't just take it to the power e to get an estimate of y. It doesn't work that way. This type of thing comes up with you have heteroskedasticity, which is when the variance of the error term is a function of something you are measuring - for example, they could be linearly correlated.
This entry was a bit boring and very theory-heavy. The next entry will have some practical examples that you can use with the programs you write.
« Statistics For Programmers I: The Problem | Statistics For Programmers III: Differences of Means »