Apr 30, 2010

Book Review: How to Lie with Statistics

I've recently just finished reading a book called "How To Lie With Statistics" by Darrell Huff. The name says it all, it is a book that talks about a lot of the basic misconceptions that people have about statistics, and how some content writers will abuse those misconceptions (knowingly or unknowingly) in order to get a message across.

It was published quite some time ago (1954) but it is still quite relevant. While statistics has advanced a fair amount in that time (well I'm not sure about statistics itself, I assume it has, I know that econometrics has) the way people perceive basic statistics hasn't changed all that much.

So anyway, if you have been reading my statistics posts and like them, I recommend you check this book out. It's not quite as technical as the stuff I've been posting - which may or may not be a good thing depending on how you look at it - so it is a nice and easy read while you're on the bus or metro/subway. Oh, and there are lots of pictures, which are sometimes kinda funny and usually quite helpful.

Apr 26, 2010

Discovering Fractals: Brownian Trees

After I got back into generating fractals yesterday, I decided to do up another interesting one. This one is called the Brownian Tree. The difference between this one and all the other ones that I've done is that this one is stochastically generated, where the other ones are all deterministic. This means that each image is random, so I can't tell you how to generate each one exactly the same - although technically since computers use pseudorandom numbers and not real random numbers, I could just give you the seed and you're set. However I didn't record the seeds.

The way this one works is based on something in reality. You start with a world. Within this world, you fix a seed particle. Then you repeatedly add new particles to the world and have them float around. When the new particle bumps into the seed it becomes part of the seed.
In pseudo-C, it would look something like this:
bool world[SIZE][SIZE]; // assume it is pre-set to false
world[rand() % SIZE][rand() % SIZE] = true;

for (int i = 0; i < NUM_PARTICLES; i++){
particle = [rand() % SIZE, rand() % SIZE];

while (true){
projection = particle + random direction
if (projection out of bounds){
// do something
}if (world[projection] == true){
world[particle] = true;
break;
}
particle = projection;
}
}
plot(world);
What this ends up doing is generating structures that look really organic. Here's an example:It looks like a shrubbery!

So two things to note about the algorithm. When the particle goes out of bounds, I just put it in some random other spot in the world. Also when the particle collides with the structure, I keep track of how long it took to get there (it resets to 0 when it goes out of bounds). Based on how long it takes, I give it a different colour. This leads to the nice layering effects that you see there.

You can change this algorithm in quite a number of ways. Here's a modification where when it bumps into the side of the window, it just sticks there:Or one with multiple seeds:Or one where instead of using a point for the seed, you use a collection of points in a ring (I stole this idea from the Wikipedia page):One thing I read briefly in my searching is that you can also generate music using fractals. This would be quite interesting, and would force me to learn how audio files work! Maybe I will post something on it at some point.

Discovering Fractals: The Lorenz Attractor

It's been a long time since I posted anything about fractals. I've been looking at them a bit more recently and have decided to put up more pretty pictures for you all. Today we're looking at the Lorenz Attractor, which is one of the earliest discovered chaotic systems - technically many systems had been discovered before that, but this is one of the ones that was discovered after someone (that someone being Lorenz) had figured out what chaos was.

It's a fairly simple system, although less simple than the previous ones I've written about. It follows a set of differential equations(I just grabbed these from the Wikipedia page):
dx/dt = σ * (y - x)
dy/dt = x * (ρ - z) - y
dz/dt = x * y - β * z
If we translate these to C (I do all my fractal stuff in C, since it seems to be hopelessly slow in anything else) we get:
for (t = 0; t < NUM_ITERATIONS; t++){
x1 = x + dt * sigma * (y - z);
y1 = y + dt * (x * (rho - z) - y);
z1 = z + dt * (x * y - beta * z);

x = x1;
y = y1;
z = z1;

// plot point
}
You'll want some initial conditions, I used x = 0.1, y = 0.0, z = 0.0. You also have a time-step dt, which I set to 0.001. This determines how "smooth" your picture will look.

Here's a picture of what it looks like:This is with NUM_ITERATIONS = 100 000. I just used a mostly orthographic projection - the x just maps to x, and y maps to y when you plot it on the screen, and you ignore z. I did a bit of scaling and translating to the x and y so that it fit nicely into an 800x800 window.

The interesting thing is the colours (it may look ugly, but there is some science behind it!). This picture is effectively a trace of a particle moving through time. The more red it is in this picture, the earlier it is in the particle's path, and the more blue, the later (obviously purple ones are in the middle). It's interesting because except for the little red swirls at the beginning there, the red, purple and blue paths are all fairly mixed up (it might look like the blue ones are more clustered, but keep in mind that blue pixels are plotted later, and therefore will overwrite any pixels that might have been there before). This is evidence of something called topological mixing, which as I understand from my brief digging through Wikipedia basically means that the paths taken at different time ranges will inevitably overlap.

Anyway, that's enough about the actual math behind chaos theory. You can probably even take on Jeff Goldblum now! I'll work on other fractals sometime and give you guys more neat pictures.

Apr 20, 2010

Testing Random Numbers

After reading Cryptonomicon where they spoke a lot about random numbers, I was a bit curious about how that kind of thing all works. I remember when I took a course on cryptography we very briefly scanned over the topic, it had to do with discrete modular logarithms or something like that (it was a few years ago and my memory is a bit foggy), and more recently I was interested into looking into this a bit further. So I grabbed a copy of The Art of Computer Programming, Volume 2 by Donald Knuth from Concordia's library which covers random number generation in a fair amount of detail. It's quite interesting and I encourage you to grab a copy for yourself and check it out. The book is a bit dated and heavy in the math (which I know some programmers hate) but it is still a good learning experience.

While I was reading Knuth talks a bit about how to test to see if a sequence generated by a random number generator is actually random. I decided to stop reading there to see if I could think of a way to figure out this test myself. Here's my guess: if the correlation between the nth number generated and the (n + 1)th is zero for all n, then the sequence is random. Of course this is probably not an amazing test for randomness, but it is still fun to play around with.

How would you go about testing this? Well, I've been writing posts on statistics, so why not take a statistical approach? I'll generate a sequence of random numbers, and then do a regression on this equation:
ni = α + ρni - 1 + ε
What does this mean? Basically it measures the effect of a value produced by the random number generator on the next value that is produced. This effect is captured by the variable ρ. If it turns out that ρ is not statistically different from zero, then I can conclude that the sequence is random. The α is the average of the random number generator - for Ruby it is 0.5 since rand gives you a number between 0 and 1, with C the average should be RAND_MAX / 2.

Here's how we go about testing a random number generator. I'll be testing Ruby's built-in rand function. You can use a statistical tool like R to run this regression, however I will be shamelessly advertising my StatsAn project for this post.

First, the data. This code will put 27982 random numbers into a CSV file, which is what StatsAn uses at the moment (I just picked 27982 out of thin air, use whatever you feel like).
File.open("output.csv", "w") do |f|
f.write("n\n")

1.upto(27982) do
f.write("#{rand}\n")
end
end
You can load this file into StatsAn by clicking "Import File" at the bottom and uploading the file. It will take a sec to upload.

When it's done, you'll see on the left bar under "Datasets" there is an n there. This is the dataset. You can run the regression equation that I mentioned before by clicking the command field at the bottom and running this command:
regress n lag(n)
This spits out a whole bunch of numbers related to this regression (looking at them real quick I realized not all of them are totally correct, but the ones related to this example are). The ones we're interested in are in the big table at the bottom. There is a row named int which is the information about the intercept. The estimate for this is under the Coeff. column, which for me is 0.50159; and the confidence interval at 95% confidence is the Lower 95% and Upper 95% columns, which for me is [0.494806, 0.508382]. Since 0.5 is in this confidence interval, the estimated intercept (aka the α from our equation) is what we would expect.
Next, the coefficient estimate for the lagged value for my sample is 0.001113, which is pretty close to zero. This number reflects the correlation between the two generated random numbers. If we look at the p-value for this number it is 0.852255, which means that if the actual value of ρ is zero, then we have an 85% chance of getting a sample that gives us the estimate that we got. That's a pretty high chance, so we can't say that the actual value of ρ is different from zero. Therefore our estimate of ρ is not statistically different from zero, so we conclude that the generator is random.

Anyway as I said before this probably isn't a perfect test - I stopped reading the book to see if I could figure things out for myself, since that's how programmers roll! There are probably better ways to test the randomness of a random number generator. On top of that, there are probably a number of statistical problems with this that I haven't thought of. However it is a fun little example, and it is an easy way for anybody interested to fiddle around with a statistical tool.

UPDATE: There could indeed be problems with this. If it is the case that the system is actually random then we are fine. The problem is when the system is not random, then our test here might be a fair bit broken. Unfortunately the problems with it are a bit more advanced than anything I've talked about, so I won't go into detail on any but one: suppose the nth element is correlated with the (n + 1)th element, and it is correlated with the (n + 2)th element. Then we have an omitted variable bias here which might mess up our results.

By the way, if you notice any problems with StatsAn, let me know. It is still beta software!

Apr 19, 2010

Ruby Syntax Gem

I've been using a gem called syntax to transform Ruby code into something that looks nice in HTML. It is quite easy to install:
sudo gem install syntax
And easy to use:
require 'rubygems'
require 'syntax/convertors/html'

convertor = Syntax::Convertors::HTML.for_syntax "ruby"
html = convertor.convert(File.read(ARGV[0]))
By default it supports Ruby, XML and YAML, and there might be a bunch of other ones out there on the Internet somewhere.

Speaking of syntax, I'm wondering if anybody knows of a good colour scheme for Gvim. I've been using one called pablo that comes with it for a while since it is pretty much the only dark theme that they have that doesn't make my eyes bleed, but it still isn't amazing. Does anybody know of one that looks really good that they are willing to share?

Apr 13, 2010

Mac Fonts in Ubuntu

Back when I was working with a bunch of Mac guys, they would always complain about looking at my screen because the fonts were bad. I do believe in the importance of aesthetics, so I decided to just put the same ones they had on my computer. You can get this fairly easily, just open Applications->Accessories->Terminal and copy-paste this:
wget http://ubuntu-debs.googlecode.com/files/macfonts.tar.gz
tar zxvf macfonts.tar.gz
sudo mv macfonts /usr/share/fonts/
sudo fc-cache -f -v
One issue with this is that it messes up how Firefox renders input tags. Supposedly this is because the Lucida Grande font included in this macfonts bundle are not correct, so you can get the real ones here. Just download the file there, and put the font files into your /usr/share/fonts/macfonts folder and all will be well!

Apr 9, 2010

Ubuntu for Web Development, Revisited

I was reading a post I made a couple years ago, it strikes me as amazing how much I've learned and changed in those years! So I'm revisiting this post with the way that I see things now.

The question is, "Can you use Ubuntu for web development?" The answer is yes, since that's what I've been using since I left Mansef in the summer of 2008. I've discovered it works really well, since everything is right there for you. It is quick and easy to set everything up, and there are a number of great tools available for you to use for coding.

A quick disclaimer before I get too into this, I am not trying to say that Ubuntu is better than other platforms for web development, I'm saying that if you want to use Ubuntu for development then it is quite possible and you're not really going to hurt yourself by trying - except for maybe the little hiccups in the beginning when you're still getting used to the system, but this is true for any software.

I'll talk about various different platforms:
  • Java/JVM Languages - I haven't really done any Java development under Ubuntu, so I can't really give a good answer about it. I've done a fair bit of JRuby which uses a lot of the same tools and software, but they aren't all the same. Anyway I'm assuming that since Java stuff is all cross-platform, you can use all the same tools that you're used to using. Eclipse and Netbeans work fine under Linux, as do servers like Tomcat and Glassfish.
  • Ruby - I don't really need to say much here, since most Ruby developers seem to view Linux as a suitable development platform anyway. From my experience with Ruby development you usually just use a text editor and a command line for everything, which Linux is ideal at. The only reason not to use it is if you're married to Textmate.
  • Others - I'm not too well versed with other development like with .NET or Python, so I can't really give a good analysis here. I'm going to assume that if you're using .NET then you're pretty screwed with anything that isn't Windows, although Mono seems to be doing fairly well. As for Python I can't think of any reason why using Ubuntu would be bad, but since I haven't done too much Python I can't really say. Maybe if there is a Python expert in the audience they can shed some light on this one.
The big one I want to talk about is PHP, mainly due to it's popularity and also because it's the one web language that I've worked with extensively under both Windows and Ubuntu. So let's dive in!

Setup


Setting up PHP under Windows is a pain. Where I worked we used a Linux server and mounted a network share as our dev drive. This is not a bad solution, however to me it seems a bit inefficient to have to use another computer to do your development when you can just do it all on your own machine. This gets even more problematic when you don't have admin rights to the dev server and can't fiddle with things (this gets really bad when files/folders are owned by www-data, so you can't modify them without executing some PHP code from the server).
If you want to install everything on your own computer, you have to go out and download all the software manually and install it. I'm not sure how much configuring you have to do for this either, but it is still a pain.
After a bit of research I discovered PHP Triad, which supposedly solves a lot of these problems. I'll have to check this out later on.

Setting up PHP under Ubuntu on the other hand is trivial. From a terminal (you can also install these via System->Administration->Synaptic Package Manager, it's just usually faster to use a terminal) you can just do:
sudo apt-get install apache2 php5 mysql-server php5-mysql
This will install the web server, PHP, and the DB server; it will also come pre-configured for Ubuntu so that you don't really have to tinker with anything. If you go to your browser after doing this and browse to http://localhost/, you should see something that says "It works!" This means that Apache is installed properly. Now you can do everything you need to do without worrying about network shares or any of that. The nice thing about this is that you don't have to worry about the network going down, permissions, other people messing with your stuff, etc.
The default folder for Apache is /var/www. Typically what I do since I'm usually working on different sites is make sub-folders in here for each project and then give myself write access to them:
sudo mkdir /var/www/my_project
sudo chown rob:rob /var/www/my_project
You would then access this project by going to http://localhost/my_project. Make sure to use your username instead of "rob".

Tools


Of course, the system wouldn't be very good without any development tools. Let's talk about a few of these:
  • IDEs/Text Editors - I've found that there are plenty more IDEs available for Windows than for Ubuntu. I've used Dreamweaver and PhpEd under Windows, and Quanta Plus for Linux. Of these, PhpEd is by far the best since it is an actual PHP editor rather than an HTML-editor-turned-PHP-editor and it comes with nice things for profiling and debugging. On the other hand it is not free in any sense of the word, so be prepared to shell out some cash if you want it.
    Ones that I haven't tried include PHP Triad which is Windows only, Bluefish which is Linux only, and Eclipse and Netbeans which are cross-platform.
    From this you'd say that Windows has the better editors, and I'm not sure if I'll argue against that - although if you're one of those folks who swear by Eclipse or Netbeans (and there are a lot of them out there!), then you're really not constrained by the operating system.
    These days though, I do all my development using Vim. I'd argue that it is important to know how to use Vim at least a basic level, since so many of the *nix/BSD servers out there might only have Vim installed on them - although once I used a Linux server that had pico as the only editor installed, which was odd.
    Since this is an article about the merits of Ubuntu, not Vim, I'll save the Vim argument for another day. However since it (and its popular alternative, Emacs) are available for most desktop operating systems, there isn't really a reason to use one OS over another.
    So who wins here? I'd say Windows if you're in love with a specific Windows-only editor, however if you're willing to learn something else then there are plenty of good programs available in Ubuntu.
  • Databases - Most PHP apps use some kind of database, whether it is MySQL, PostgreSQL, etc. If you're using a database that doesn't work under Linux (like MS SQL) then you're probably better off not using Ubuntu, but pretty much everything else will work fine. If you're using MySQL or PostgreSQL it is nice and easy to install phpmyadmin or phppgadmin via Synaptic or apt-get to give you a database interface that is slightly easier to use than the command line. Who wins here? Nobody really, since databases other than MS SQL are available for most operating systems.
  • Command-line Tools - Here is where Linux wins. I know many people hate the command line (I used to as well) because it has a steeper learning curve and/or feels archaic, but it is an extremely efficient way to get things done sometimes. When you couple it with simple tools like grep, or bash scripting, or with specific programs like ffmpeg/mencoder/ImageMagick you can get your PHP app to do a lot of interesting things.
Anyway in conclusion I feel that Ubuntu is quite well suited to web development in most languages - the exception being any Microsoft languages. You have a large variety of tools available to you (probably not quite as large as with Windows, but the nice thing about Linux tools is they're usually free) many of which are well proven in the market. There is a bit of a learning curve to Ubuntu if you've never used it before (although I think that if you're reading this blog, chances are you know your way around Ubuntu) and that might deter a few folks, but if you're wanting to use Ubuntu for web development then you can go for it and you're not putting yourself at a disadvantage.

Apr 8, 2010

Statistics For Programmers VI: Omitted Variable Bias

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

In the fourth post in this series, I spoke briefly about the assumptions required for using regression to analyze your data. One of them was about the error term having zero mean conditional on your independent variables - this means that if you hold your independent variables fixed, then the average of your errors should be close to zero.

There are certain situations when this assumption can be violated. One common one that I'll be talking about today (obviously, given the title of the post) is called omitted variable bias. This occurs when you don't include an independent variable that has an effect on the dependent variable, and also is correlated with other independent variables (I believe omitted variables are also called confounding variables). What happens when you use ordinary least squares is that your regression doesn't know that the variation in the omitted variable (let's call it c) is causing changes in the dependent variable (call it y). Assuming that c is correlated with an independent variable x, the regression cannot tell whether the variation in y is being caused by x or by c since you haven't included c in the model. Therefore it will attribute all of that variation to x.

Let's look at an example. Suppose you've been arguing with someone on reddit about agile vs. waterfall and you want to prove which one of these makes programmers more productive (while I don't know if one is actually better or not, let's assume for illustration purposes that one of them is). Now suppose you've read all my previous posts on statistics, and you've taken a course in statistics in university, so you feel fairly confident in your ability to do statistical analysis. You go out and start collecting some data from various companies (suppose they cooperate and give you their data, and that it is reliable). You collect various types of data: how big their team is, the experience of the team members, the language(s) and libraries they're using, and finally whether they are using agile or waterfall. Also suppose you have some reliable way of measuring productivity (this is another one we'll have to assume, since measuring programmer productivity isn't that easy). So you run a regression on all this and it turns out that agile has a significantly larger effect on programmer productivity than waterfall. You post your results on reddit and say, "Ha! The data says so!"

Unfortunately there is a problem here. What you haven't included is the average ability of the programmers at the various companies. It is fairly certain that the average ability of your programmers will impact productivity (of course that will require you to hold constant any synergistic effects between programmers - see, this stuff is hard!). Again for illustration purposes let's pretend that hot-shot programmers like to work in agile environments rather than waterfall environments, so the average ability for programmers will be higher for agile than for waterfall. So we have an omitted variable that is correlated with both the dependent variable (productivity) and the independent variable we are interested in (agile vs. waterfall). This will bias our estimate for the effect of agile vs. waterfall, and the size of the bias typically looks like this:
bias = (effect of average ability on productivity) *
(correlation between average ability and agile vs. waterfall)
Since these are both assumed to be positive values, the bias will be positive (since I'm assuming ability has a large effect on productivity, this bias will also be large) - the effect of agile will be very much overstated in your analysis.

Of course, you could always decide to lie with your statistics and hope that your target audience doesn't know very much about statistics to properly argue against your results (especially if you include math, they might not even read the analysis!). This is not a very respectable thing to do, but that doesn't stop people from doing it anyway - how many statistics do you see in the newspaper, magazines, blogs etc. that do not include a standard error and confidence interval? or when they say average/mean, do they say which type of average/mean? or do they tell you their sample selection methods? I could go on like this for a while.

Anyway let's assume that you decide to do everything legit and you want to control for programmer ability. The problem with this is how do you give an objective number to each programmer? Sometimes in labour economics worker ability is treated as an unobserved variable, which is a variable for which you cannot get a good quantitative value. So how do you go about measuring this? There are some techniques that have been developed like instrumental variables which can help, but typically you're going to have to make some trade-off between biased results and imprecise results.
In fact, productivity may also be considered an unobserved variable!

On another note, I'd like to get some feedback on these posts. Do people find them helpful, or am I just boring you with what you learned in your stats classes?

« Statistics For Programmers V: Performance Analysis