While I was reading Knuth talks a bit about how to test to see if a sequence generated by a random number generator is actually random. I decided to stop reading there to see if I could think of a way to figure out this test myself. Here's my guess: if the correlation between the nth number generated and the (n + 1)th is zero for all n, then the sequence is random. Of course this is probably not an amazing test for randomness, but it is still fun to play around with.
How would you go about testing this? Well, I've been writing posts on statistics, so why not take a statistical approach? I'll generate a sequence of random numbers, and then do a regression on this equation:
ni = α + ρni - 1 + εWhat does this mean? Basically it measures the effect of a value produced by the random number generator on the next value that is produced. This effect is captured by the variable ρ. If it turns out that ρ is not statistically different from zero, then I can conclude that the sequence is random. The α is the average of the random number generator - for Ruby it is 0.5 since rand gives you a number between 0 and 1, with C the average should be RAND_MAX / 2.
Here's how we go about testing a random number generator. I'll be testing Ruby's built-in rand function. You can use a statistical tool like R to run this regression, however I will be shamelessly advertising my StatsAn project for this post.
First, the data. This code will put 27982 random numbers into a CSV file, which is what StatsAn uses at the moment (I just picked 27982 out of thin air, use whatever you feel like).
File.open("output.csv", "w") do |f|You can load this file into StatsAn by clicking "Import File" at the bottom and uploading the file. It will take a sec to upload.
When it's done, you'll see on the left bar under "Datasets" there is an n there. This is the dataset. You can run the regression equation that I mentioned before by clicking the command field at the bottom and running this command:
regress n lag(n)This spits out a whole bunch of numbers related to this regression (looking at them real quick I realized not all of them are totally correct, but the ones related to this example are). The ones we're interested in are in the big table at the bottom. There is a row named int which is the information about the intercept. The estimate for this is under the Coeff. column, which for me is 0.50159; and the confidence interval at 95% confidence is the Lower 95% and Upper 95% columns, which for me is [0.494806, 0.508382]. Since 0.5 is in this confidence interval, the estimated intercept (aka the α from our equation) is what we would expect.
Next, the coefficient estimate for the lagged value for my sample is 0.001113, which is pretty close to zero. This number reflects the correlation between the two generated random numbers. If we look at the p-value for this number it is 0.852255, which means that if the actual value of ρ is zero, then we have an 85% chance of getting a sample that gives us the estimate that we got. That's a pretty high chance, so we can't say that the actual value of ρ is different from zero. Therefore our estimate of ρ is not statistically different from zero, so we conclude that the generator is random.
Anyway as I said before this probably isn't a perfect test - I stopped reading the book to see if I could figure things out for myself, since that's how programmers roll! There are probably better ways to test the randomness of a random number generator. On top of that, there are probably a number of statistical problems with this that I haven't thought of. However it is a fun little example, and it is an easy way for anybody interested to fiddle around with a statistical tool.
UPDATE: There could indeed be problems with this. If it is the case that the system is actually random then we are fine. The problem is when the system is not random, then our test here might be a fair bit broken. Unfortunately the problems with it are a bit more advanced than anything I've talked about, so I won't go into detail on any but one: suppose the nth element is correlated with the (n + 1)th element, and it is correlated with the (n + 2)th element. Then we have an omitted variable bias here which might mess up our results.
By the way, if you notice any problems with StatsAn, let me know. It is still beta software!