Apr 23, 2011

ProjectDrinks, Take 2

Just a reminder to anybody who is interested, this Monday is the second ProjectDrinks meetup. It will be at 6:30pm at Trois-Brasseurs, at the corner of St. Catherine and Crescent, downtown Montreal.

Last time people had some trouble finding the table, so it will be the one with a laptop on it.

I was contacted last time about Notman House, which is supposedly a meeting place for web developers and tech entrepreneurs. It looks kinda neat, although the idea of the house seems to be very startup-oriented, which isn't quite in the same direction that I'd like to see ProjectDrinks go. I'll check it out at some point once I'm done exams and let people know how it goes.

Apr 21, 2011

Bad Statistics: My Mistake

I was watching this video today and one comment the speaker made was that men's incomes have not increased but women's incomes have. I thought this was interesting and decided to look into it.

However while I was thinking about it I realized that in my last wage analysis I was using 1986 vs. 2006, which wasn't a great idea.

Here's why. What's the difference between this graph:

And this one:

Although these graphs look nothing alike, when you only have two data points you can't actually tell which of the two graphs created this.

That's the issue with my last analysis. We have these weird things in the economy called "business cycles" which you may have heard of, where the economy is doing "well" or "poorly". Unfortunately these cycles could screw up the analysis that I did before since it could be possible that wages haven't gone up at all in a general sense, it could have just been the case that in different economic states in 1986 and 2006 rather than that wages have actually gone up. In 2006 for example, the Canadian economy was doing very well due to high commodity prices, where 1986 wasn't a very big boom period.

This is an example of a very basic statistical problem: model misspecification. In many cases when you're doing statistics you have some sort of model that you are trying to fit the data to. If the model you propose is correct then cool, but if it is incorrect you might not know. Many statistical procedures such as least-squares regression will still work when you give a bad model, but will give meaningless results. Unfortunately it might be the case that the results look reasonable, but are still completely incorrect - this is the most dangerous case, since without further analysis there is no real indication that your results are wrong. Fortunately there are tests like the RESET test that can test for this kind of thing, however you have to know about them in order to actually use them (obviously). They also aren't foolproof - on one hand they will tell you that you do have misspecification, on the other hand they will tell you that you might not have it. As with most things in statistics, you don't get a crisp yes/no answer.

The moral here is that even if you have a representative sample and you have good intentions (ie. not screwing with the results so things look the way you want them to) you can still get bad results by applying the wrong procedures.

Back to the original question: how have wages changed over the years with respect to men and women? Keeping in mind all the stuff that I just said, here are the results:
Median income (2010 dollars)
Men37472 (106)38442 (182)
Women19331 (72)25628 (87)
The numbers in brackets are the standard errors, this is how statistics are reported in real papers. If you ever read a paper that reports statistics with comparisons, you should see if they show these numbers. If not, you can't really be too sure about the comparisons - a good example is a survey on game piracy that I critiqued a while back.

So women's wages have gone up a fair bit, while men's wages have not - consistent with the claim in the video. Whether this is a general trend or some blip due to the time periods chosen we can't say with the data here. I can probably find data if I dig around a bit if people are curious.

Apr 18, 2011

Fixing a Dead MySQL Server on Ubuntu

I'm doing some freelance work for a company and unfortunately the other day their DB server crapped out. On top of that, there was some issues with the backup script, so the last backup done was quite some time ago.

Fortunately though, the hard drive was still intact. Here is a little guide to restoring your MySQL DB in the case that your DB server dies and you have no backups. This is a guide for Ubuntu server 10.04, but it shouldn't be too different with other versions of Ubuntu or other Linux distributions.

There is one requirement here: the hard drive still needs to work. If the hard drive is toast, then you're in trouble.

1) We had another computer that wasn't being used that had been purchased to be set up as a backup server. It wasn't set up yet, so I just stuck Ubuntu Server on it and plugged the old hard drive in.
2) Make sure MySQL is installed: sudo apt-get install mysql-server
3) Stop the MySQL server: sudo /etc/init.d/mysql stop
4) MySQL keeps the database files in /var/lib/mysql. Copy these from the old hard drive to the new hard drive. It might be a good idea to not copy over /var/lib/mysql/mysql, however I didn't have any problems after copying that over (yet). Try it and if the database doesn't work, copy that folder too. Also, don't copy over any .pid files.
5) Set permissions: sudo chown mysql:mysql -R /var/lib/mysql/*
6) Restart the MySQL server: sudo /etc/init.d/mysql start

Now the MySQL server should be just like it was before the system died. If you didn't copy over the /var/lib/mysql/mysql folder then you'll have to recreate any users that you had. If the whole thing just doesn't work then just copy that folder too.

Hopefully this saves some people some pain! I was particularly annoyed since I forgot step 5, so the MySQL server was able to list the other databases but kept saying that there were no tables in any of them. Not good! In the end it just turned out to be a permissions issue which is no big deal.

Anyway the more important thing to take out of all this is to make sure that your backup scripts work.

Apr 17, 2011

Empirical SoEn is Hard: Cost

In economics, there is a small sub-branch called experimental economics, which focuses on applying the experimental method to better understand economic phenomena. It has had some interesting results in small cases - especially in the field of game theory with things like the ultimatum game - but most of the papers I've read are done at such a small or simple scale that they don't really apply to a real economy. In order to make the results more interesting, the number of people involved must be a bit higher, and the experiments need to be done with a longer timeline. Unfortunately, the cost of this is extremely prohibitive (university departments are typically strapped for cash).

I've done a lot less reading of software engineering papers, but in my head it also seems like cost would pose a very large problem on empirical software engineering research. You could do some simple experiments with small groups of developers for small projects, however in reality many projects take large groups of developers working full-time months or even years to complete. Having a researcher pay these developers to do experiments might be prohibitively expensive. On top of that to be able to make any decent claims, you'll need to do a rather large number of these experiments which just blows the cost of everything through the roof.

At this point you can't even use experiments anymore to do detailed study of software engineering. In the business world where you might have the resources to do basic experiments with large-scale projects, you typically have an aversion to risk and so businesses typically won't go too far from "proven" techniques since they may or may not increase the costs of doing business.
In economics typically we rely on observational data to do our analysis: we can collect data from the actual economy and analyze it, however we can't reach out and change whatever parameter we like to see how it will affect things. We also can't "start over" to see what happens if we change a few of the initial conditions: unless somebody develops time travel and goes back to 1950 to fiddle with unemployment rates or government policy to answer "what-if" questions. You can't impose controls the same way that you can when experimenting.

I believe that software engineers are in the same boat. Due to cost concerns, you must rely more on observing actual software engineers at work than in the laboratory. This comes with all the problems with observational data: you can't go in and change the number of developers, or swap developers out with developers of different skill levels, or any of these other things that you might be able to try in an experiment. Also in observational data it is often the case that not just one thing changes at a time: sometimes one factor will change, but another will change at the same time since they are correlated - it is difficult to distinguish the effects of one variable from the effects of another when you aren't able to hold them fixed.