May 26, 2010

The Plural of Anecdote is Not Data

One common logical fallacy that people use in arguing (especially online) is using anecdotal evidence as actual evidence. This means that they find some example of something that corresponds with their argument, and use that example as evidence to support their argument.

You might think at first, "what's so bad about that? They're using evidence to support their argument, isn't that what you're supposed to do?" Yes, I suppose this type of argument is better than just pulling something out of your ass with absolutely no evidence behind it (which is probably just as common online as anecdotal evidence). However, that doesn't mean it is good. The issue is that for pretty much any side of an argument you can find an example that supports that argument. A common one that is loved by the Canadian Pirate Party is, "so-and-so artist says that he/she likes piracy because it boosts concert sales, therefore piracy is good." This could be a legitimate argument, and economically it is possible - reduction in the price of a good (the music) increases demand for a complement good (the concerts), but without detailed statistics the effect is ambiguous to the musician's overall welfare. The problem is that one example alone does not provide a justification for the argument. Instead you would need a lot of data from a good sample to verify the actual effect.

From a more formal standpoint, the root of the problem is a sampling bias. You are taking a potentially non-representative subset of the population and claiming that it is representative. There is no guarantee that this particular anecdote is representative, so any conclusion you derive from the anecdote about the population isn't really valid. In the example in the last paragraph, they're using a subset of a subset of the population (a few musicians do not represent all artists).

This isn't to say anecdotes are all bad. You can derive useful insights about larger phenomena by analyzing anecdotes, for example asking "why do certain artists like piracy" can lead you to the basic economic analysis that I mentioned earlier. However it ends at the insight, if you want to derive a conclusion about the population with this insight, you have to resort to more advanced statistical methods.

May 15, 2010

Linux Schematic Software - gEDA

I recently had a need to do up a basic schematic on the computer, and just save it. The first thing I attempted was Dia, which I've used before for basic software diagrams. It's half-decent, and when I did the software diagrams it pretty much did what I wanted it. However for schematics I find it isn't so great and I thought there might be a better alternative.

I did a quick search in Synaptic for "schematic" to see what it would come up with. There are a whole bunch, and I went through a few of them. The best one that I found in there was gEDA, for the following reasons:
1) Massive library of components. I had a specific IC for the circuit I wanted to draw, and it was in the library. I was expecting to have to create some generic IC with a certain number of pins, but they had the one I wanted. Also, the library has a search feature, so you don't have to scan through to find the one you want.
2) Wires can be drawn easily, and as you want them. They also will snap to other wires and components. If the components have certain pins (like ICs) you can easily choose which pin you want to connect to.
3) Interface isn't too unwieldy. I would have a few suggestions (maybe I can add them as a little project) but other than that it is very functional and stays out of your way.

There might be some better features you might want, but these are the ones that I like and needed for this specific case.

May 12, 2010

Why You Should or Shouldn't Use an Operating System

You should use Windows if:
- you respond more quickly to your WoW handle than to your given name
- you need to prove your 1337 h4xx0r skillz, yet can't figure out how to use Linux
- you want to use an operating system that doesn't look like it was designed by:
a) Fisher-Price, or
b) HAL 9000
- you work in a place where the manager buys the product from the salesman with the shiniest hair (or in the case of Steve Ballmer, the shiniest head)

You should not use Windows if:
- you compulsively click on every link you get on MSN or Facebook
- you ever use your credit card or something confidential online
- you want your computer to work right for over 6 months at a time without critical system files being somehow corrupted

You should use Mac if:
- you think a hard drive is something that involves at least 3 hours in a car
- your favourite letter is i, or your favourite colour is white
- you spend most of your time in a coffee shop

You should not use Mac if:
- you have a soul
- you want other people to like you after you talk to them about computers
- you're poor, like me

You should use Linux if:
- your first or second language is Klingon
- the idea of installing an operating system on a pacemaker gets you excited
- you want to stick it to the man

You should not use Linux if:
- you have friends
- you want your computer to work, period
- you think hexadecimal is a character from ReBoot

May 11, 2010

A Quick Look at Lucid

Alright, it's that time again! A new version of Ubuntu has been recently released, and I'm going to do a quick write-up about it.

I just installed it today and have been using it a bit to see how things go.

First thing to note is the new look. While I don't like the purple splash screen (purple isn't really my colour) I think everything else looks really good. Everything looks much more slick than before. I'm not really sure why they moved the window controls to the top-left corner to the top-right, but meh. If you want to change it back to the top-right, it's not hard.
The only issue with the new colour scheme is the interaction with Firefox's autocomplete - when you start typing a URL, I find that the drop-down that comes out is a bit difficult to read since it is a dark background and light writing on top of a white screen. I'd prefer a white background with dark writing, but that's just me.

Rhythmbox updates - I don't really like the new system-tray interface. I prefer the old one where you left-click and it opens the Rhythmbox window, and right-click gives you the context menu. Now it is left-click to get a context menu, and right-click doesn't do anything. My main problem is the inconsistency (right-click is context menu for everything else still) and that it now requires two clicks to do something that I do all the time. A little detail, but something that bugs me - I love efficiency.
Also the music toolbar applet (now known as Panflute) appears to be gone, which is a bit disappointing.
Finally, I think the Ubuntu Music Store is a good initiative, but I don't really have any other comments about it.

Other than that, I'm happy that it seems to be fast and stable. Good job Ubuntu devs!

May 10, 2010

Piracy Boosts Game Sales? Yeah right...

Some pirates in the Canadian Pirate Party are getting all excited about a certain paper, part of which says that file-sharers tend to buy more games more often than non-file-sharers.

First off, let's address some problems with the paper. The paper says that 61% of file-sharers have bought games in the last 12 months, vs. 57% of non-file-sharers. They say that this is evidence that file-sharers buy games more often. I say that there isn't enough information here to say who buys games more often. They don't specify any type of variance in the paper, so you can't actually say if the difference between these two figures is actually significant (this is an example of a difference of means).
The other numbers are 4.2 vs. 2.7, which represents the average number of games bought over the last year by file-sharers and non-file-sharers respectively. Again there is no mention of a variance measure, so you can't really take this difference seriously.

Next, they claim that only 53% of their sample answered questions about games. If their sample was random before, it's probably not anymore! This is an example of a self-selection bias. It could be that all the people who pirate games and never buy them chose not to answer the question, or it could be that people who always buy them and never pirate also chose not to answer (this one I think is a bit more unlikely, but not impossible). Basically you no longer have a guarantee that the sample that actually answered the questions about games is a random sample.

The next problem is with addressing the interpretation of the paper. Some people seem to think that this result shows that piracy causes more sales, and overall is a good thing for the gaming industry. This is a definite possibility, and the paper addresses this and claims it is due to something called the sampling effect - an example of this effect is trying a sample at Costco, you trying the good might increase your demand for it. I completely agree with this, it's entirely true that piracy can increase the demand for a game because they get a chance to try the game before they buy it - I've personally experienced this, I pirated Half-Life 2 and Oblivion and then later bought them because they are awesome games.
However the effect can go the opposite way. A lot of games are only fun the first time you play them. What might happen is that the person will pirate the game, play it through, and then never want to play it again. They may have loved the game, but don't really want to shell out the cash to buy it and not play it anymore. An example for me was Spore - it was kinda fun the first time around, but after I realized that the game wasn't really interesting for very long I had no incentive to buy it (let's ignore DRM-related issues for the moment too). In this case I might also say piracy would be a good thing, since it gives game creators an incentive to make games that don't suck.

It is also possible that there is no causal effect at all between these two. It could be that those hardcore gamers out there who buy a lot of games also pirate a lot of games; while the casual gamer who maybe buys a game once or twice a year has no idea that he can pirate a game or has no desire to "break the law" for something he doesn't really care about.

Anyway none of these points prove the paper wrong, however they show that you shouldn't really trust the results. If I saw several more studies come up that actually show the data that points to the same result then I might be more convinced, but for now I'm very skeptical of whether or not this paper is valid.

May 9, 2010

Lego for Adults

Someone asked me recently, "why do you spend so much time programming random things that nobody will ever use?" While I didn't really have a good response at the time, the one that I like the most is, "why do kids spend so much time playing with Lego that nobody is ever going to use?" In short, programming random things is like Lego for adults. For me anyway.

The best difference here is that I don't have to take apart my software afterward in order to build something new. I can put it up somewhere just in case someone else comes along and wants to use it too.

May 8, 2010

Testing Javascript Apps with Chrome

While testing a Javascript app under Chrome today, I learned something interesting. If the HTML/JS file you're viewing in Chrome is located on the local drive (ie. you're accessing it via file:///...) then you're not normally able to use AJAX to request other files on the local hard drive. I tried doing the following in jQuery:
$.getJSON("local/file/on/hard-drive", function (response){
  // blah blah blah
Every single time, the response variable would come out as null. I wasn't sure what happened, but after a quick Google search I learned that in order to make Chrome play nice with local files, you need to disable web security:
google-chrome filename.html --disable-web-security
After this it seems to work just fine! Now to fix the other bugs...

May 7, 2010

Computing P-values

A commenter named Blues asked me a question on my post about differences of means:
... could you explain how to do the math without R? ie, if I wanted to do the calculations on paper, or write this in javascript, what are the formulae I'd use, and what's the logic behind them?
This question is about calculating the p-value for a statistic.

Unfortunately this is not that simple a problem. If you're using the standard normal distribution, the equation looks something like this:
f(x) = 1/sqrt(2 * pi) * exp(-x*x/2)
It's not a terrible formula, however the main problem is that the p-value is an area under this curve, and this function has no anti-derivative. This means that in order to calculate the integral, you need to use numerical methods instead of the traditional way of solving the integral.

Single-variable integrals for nice continuous functions like this are not very difficult at all using numerical methods. It amounts to just adding up the area of boxes. The smaller the width of your boxes, the more precise you will be.

In our case, we're trying to calculate a p-value. The first thing you need to decide is if you're doing a one-tailed or a two-tailed test. In a two-tailed test, you're saying in your null hypothesis that the thing you are trying to test is equal to some certain value, call it c. Therefore if your statistic is either much lower or much higher than c, you reject the null hypothesis. In a one-tailed test, you're using an inequality in your null hypothesis. You'd say something like, "the value I'm trying to estimate is less-than or equal to c." Therefore you'd only reject the null hypothesis if the test statistic was much higher than c.

In the end, there isn't a huge difference in the math. You compute some value, and in a two-tailed test you'll multiply this value by 2.

So what is the code to do this? Well we need to first define our problem. What we're doing is the more general concept of integration of a function between two values. The Javascript version (I'll use Javascript because that was the language asked in the comment) looks like this:
function integrate(a, b, f){
  // what we are trying to do is find the area under f in between a and b
  // assume that b > a
  var dx = (b - a) / NUM_BOXES; // NUM_BOXES is some large constant
  var x = a;
  var sum = 0;

  while (x < b){
    // add on the size of the box
    sum += dx * f(x);
    x += dx;
  return sum;
This algorithm is pretty good, but in the case of a decreasing function it will always over-estimate the area, and for an increasing function it will always under-estimate the area. You can get a better estimate by using a mid-point:
sum += dx * f(x + dx / 2);
It's pretty obvious that the larger NUM_BOXES is, the more precise your result will be. However it will come at a cost, the number of times the loop is executed depends on the value of NUM_BOXES. For our p-value, what we want to do is calculate the area under the normal distribution from the absolute value of the calculated statistic (from now on I'll call this value z) to positive infinity. Unfortunately since we can't go to positive infinity, we'll have to stick to some high number. Since the normal distribution drops very quickly, once you're past a value like 5 you have a pretty decent approximation. If you want more precision, go higher. All that said, you'd calculate your p-value like this:
pvalue = integrate(Math.abs(z), 5,
    return 1 / Math.sqrt(2 * Math.PI) * Math.exp(-x * x / 2);
If it is a two-tailed test, just multiply this value by 2. So this is for the normal distribution, but it wouldn't be too hard to figure it out for more advanced distributions like Χ2 or t. You just need to plug in the correct formula for each distribution and then off you go.

May 6, 2010

Faster Background Rendering in HTML5 Canvas

One thing that is common in video games and other graphical applications is generating some sort of background image. Generating this may not be a very fast task, so rendering it every frame in this case is not a feasible option.

A common technique to use instead is to render to an off-screen surface that works as a kind of cache. If you're programming using Javascript and the HTML5 canvas element, an off-screen surface is just another canvas that is not being displayed.

To render to an off-screen canvas:
var canvas = document.createElement("canvas");
canvas.width = width;
canvas.height = height;

var context = canvas.getContext("2d");

// rendering code
How do you go about getting this onto your main canvas? It's really simple. As it turns out, the drawImage() function of the canvas context object can take not only image objects, but other canvas objects. So you would just go:
var context = mainCanvas.getContext("2d");

context.drawImage(offscreenSurface, ...);
You can get more details on the different ways the surface can be rendered by looking at the specs for the canvas. A great thing about some of the options are that you can have the background of the entire world (if the world is not gigantic) rendered to an off-screen surface, and only render a certain portion of it to your main canvas. This makes it nice and easy for you to implement some kind of scrolling mechanism.

This is very applicable for games, since many 2D games use some kind of tiled background, and need to render it 30+ times per second. Unfortunately if you have say 400 or so tiles visible at a time, using a basic drawImage() for each tile is too slow. You'd need to use some sort of off-screen surface to do this.

May 4, 2010

Fractal Trees

Fractal trees are a simple type of fractal, and illustrate a nice example of a recursive process.
You start off at a point moving in a certain direction (usually up, since that's the way trees go). After going a certain distance, you split into two and draw each branch. You repeat this a certain number of times until you have a large tree structure.
In this example, each time it branches I decrease that certain distance by a random amount, and the angle between the two new branches gets decreased each time.

Here's a picture of some fractal trees:

If you want to see how this works, you can see my submission on Rosetta code. It is written in C and uses some simple linear algebra techniques for rotation, so you might have to know a bit about rotation matrices in order to fully understand what is going on.

May 3, 2010

Introducing MinMVC

A long time ago I announced that I had built up a PHP framework for my job that was designed for high-traffic websites. It was modeled after Ruby on Rails, but let you scale a bit more easily using things like memcached and MySQL master-slave configurations (Rails probably supports these features nowadays, I just haven't paid much attention to Rails in about a year after I stopped working with it).

I finally decided to get off my ass and remake this framework (I couldn't use the old version because it was not my intellectual property, however there isn't anything stopping me from just using the same ideas). Unfortunately the current incarnation doesn't really have any of those nice scaling features, although I don't think they'd be terribly difficult to add. Adding master-slave configurations would just require a bit of tweaking on the database connection class, and the cached-model-identity-map system that it had would require making a subclass of Model that uses memcached in addition to the database.

You can grab the current version from Github here. I will warn you though, use at your own risk! It isn't very well tested yet and will probably have security holes and random bugs. Also there isn't much by way of docs yet, so you'll have to figure out how it works on your own. If you've worked with Rails before then it shouldn't be too bad, it's just a matter of figuring out what you can't do anymore (like associations, migrations, lots of helpers, support for anything other than MySQL, ...).

What's the selling point? There isn't really one. This framework is pretty simple. It doesn't have many features, and it doesn't really enforce too much. It has a basic MVC structure, pretty URLs, a simple ORM layer, and...well not really anything else. The focus is on getting out of your way so that you can do what you need to do.

Anyway if you want to fiddle around with it feel free, and if you find bugs/problems or you have a feature request or something, please post it on the Github issues page.

May 1, 2010

AJAX in 5 minutes using jQuery

One thing that really amazes me is how people can write an entire book on AJAX. It's a truly simple thing. It shouldn't even be a topic on its own, it should maybe be a subchapter in a book on Javascript or something. Anyway I'll show you how to use AJAX in 5 minutes using jQuery. You should usually use a library like jQuery (or others like Prototype or Mootools, use whichever you like the most) since using AJAX with regular old Javascript is a more tedious process with lots of repetition.

As a prerequisite to actually using AJAX in practice, you should know how to do some server-side programming using something like PHP, Ruby on Rails, etc. You should also be a bit familiar with Javascript.

All AJAX does is make a little request to the server without reloading the page. Here's some code:
$.get("profile.php", { username: "rob" },
// put the HTML content of profile.php into a div with id = "profile_page"
This simple example makes a request to profile.php with the parameters username = "rob", and drops the text into a div on the current page. This is effectively the same as accessing profile.php?username=rob directly, except without refreshing the page.

You can also do POST requests for mutable actions:
{ username: "rob", password: "secret" },
if (response == "good"){
alert("You have logged in successfully.");
alert("Invalid username/password");
The login action here would just output "good" if the login worked, or "bad" if the login didn't work.

There, you're now an AJAX master! Of course it would be better to learn by trying it out for yourself, but that's the basics of it all.

A Little Intro to Time Series

One thing you might have never heard about in an introductory statistics class is this thing called time series. It's a pretty simple idea, it's just a collection of measurements of a variable over time. Lots of economic variables are time series variables: GDP, nominal prices/wages, stock markets, etc. Also, lots of variables within the computer world are time series: CPU clock speeds, hard drive capacities, number of programmers in the marketplace, etc.

These types of variables are interesting because they have certain properties. The main one is that they are highly correlated with one another when the observations are close together - for example, the average clock speed for a CPU sold in a certain year is going to be pretty correlated with the average clock speed for a CPU sold in the year after.
Compare this to something that isn't a time series, like the time it takes you to run a piece of code. Comparing the first test to the second test isn't really any different than comparing the first test to the hundredth test. However comparing CPU speeds between say 2001 and 2000 is a lot different than comparing CPU speeds between 2001 and 1970.

It turns out that this really messes up your results when you want to use linear regression to analyze relationships. I went off to Statistics Canada to get some data for an example (I can't publish the data here, since it isn't mine and it isn't open - you can change this by helping out with open data movement). I took two variables and checked out their relationship to one another over time, from 1980 to 2008. Using my trusty StatsAn tool, I was able to fit a line between two variables with a very strong level of statistical significance. I know it was strong because the t-statistic for this variable was 18.8 - the probability of getting this t-statistic with this sample size if there is no relationship is 0 if you round off at 6 decimal places. Yikes! What were the variables that could have had such a strong correlation?

The dependent variable in this example is the population of Canada, and the independent variable is the logarithm of the capacity of hard drives in gigabytes. According to my analysis, the capacity of hard drives over the last 30 years has been a very strong driver of the Canadian population growth.


I think something is wrong here.

And in fact, there is something quite wrong here. The thing that is very wrong is that there is an omitted variable bias. However in this case the omitted variable bias is a rather unique variable - time. Consider the following two models. The first one I did is this:
population_of_Canada = β0 + β1ln(hard_drive_capacity)
population_of_Canada = β0 + β1ln(hard_drive_capacity) + β2year
The added variable here is the year of the measurement. When you add year to the mix, the t-statistic for β1 goes from 18.8 to 0.65. With this sample size the probability of getting a t-statistic like this is over 50% when the relationship between hard-drive capacity and the Canadian population is non-existent (assuming that our model is correct). That's a pretty high probability! So basically the portion of Canadian growth that was actually due to time was being attributed to the increase in hard drive capacities over the years - thus giving us a strong positive link between the two.

Anyway the moral of this story is a fairly obvious fact: it is easy to find a "relationship" between two variables over time when you're not compensating for the fact that those variables are changing over time.