May 7, 2010

Computing P-values

A commenter named Blues asked me a question on my post about differences of means:
... could you explain how to do the math without R? ie, if I wanted to do the calculations on paper, or write this in javascript, what are the formulae I'd use, and what's the logic behind them?
This question is about calculating the p-value for a statistic.

Unfortunately this is not that simple a problem. If you're using the standard normal distribution, the equation looks something like this:
f(x) = 1/sqrt(2 * pi) * exp(-x*x/2)
It's not a terrible formula, however the main problem is that the p-value is an area under this curve, and this function has no anti-derivative. This means that in order to calculate the integral, you need to use numerical methods instead of the traditional way of solving the integral.

Single-variable integrals for nice continuous functions like this are not very difficult at all using numerical methods. It amounts to just adding up the area of boxes. The smaller the width of your boxes, the more precise you will be.

In our case, we're trying to calculate a p-value. The first thing you need to decide is if you're doing a one-tailed or a two-tailed test. In a two-tailed test, you're saying in your null hypothesis that the thing you are trying to test is equal to some certain value, call it c. Therefore if your statistic is either much lower or much higher than c, you reject the null hypothesis. In a one-tailed test, you're using an inequality in your null hypothesis. You'd say something like, "the value I'm trying to estimate is less-than or equal to c." Therefore you'd only reject the null hypothesis if the test statistic was much higher than c.

In the end, there isn't a huge difference in the math. You compute some value, and in a two-tailed test you'll multiply this value by 2.

So what is the code to do this? Well we need to first define our problem. What we're doing is the more general concept of integration of a function between two values. The Javascript version (I'll use Javascript because that was the language asked in the comment) looks like this:
function integrate(a, b, f){
  // what we are trying to do is find the area under f in between a and b
  // assume that b > a
  var dx = (b - a) / NUM_BOXES; // NUM_BOXES is some large constant
  var x = a;
  var sum = 0;

  while (x < b){
    // add on the size of the box
    sum += dx * f(x);
    x += dx;
  }
  return sum;
}
This algorithm is pretty good, but in the case of a decreasing function it will always over-estimate the area, and for an increasing function it will always under-estimate the area. You can get a better estimate by using a mid-point:
sum += dx * f(x + dx / 2);
It's pretty obvious that the larger NUM_BOXES is, the more precise your result will be. However it will come at a cost, the number of times the loop is executed depends on the value of NUM_BOXES. For our p-value, what we want to do is calculate the area under the normal distribution from the absolute value of the calculated statistic (from now on I'll call this value z) to positive infinity. Unfortunately since we can't go to positive infinity, we'll have to stick to some high number. Since the normal distribution drops very quickly, once you're past a value like 5 you have a pretty decent approximation. If you want more precision, go higher. All that said, you'd calculate your p-value like this:
pvalue = integrate(Math.abs(z), 5,
  function(x){
    return 1 / Math.sqrt(2 * Math.PI) * Math.exp(-x * x / 2);
  });
If it is a two-tailed test, just multiply this value by 2. So this is for the normal distribution, but it wouldn't be too hard to figure it out for more advanced distributions like Χ2 or t. You just need to plug in the correct formula for each distribution and then off you go.

2 comments:

Anonymous said...

f(x) = 1/sqrt(2 * pi) * exp(-x*x/2)
should be
f(x) = (1/sqrt(2 * pi)) * exp(-x*x/2)

Rob Britton said...

They're the same expression (at least in most programming languages). The * and / operators have the same precedence, so the left-most / gets evaluated before the * does.