May 1, 2010

A Little Intro to Time Series

One thing you might have never heard about in an introductory statistics class is this thing called time series. It's a pretty simple idea, it's just a collection of measurements of a variable over time. Lots of economic variables are time series variables: GDP, nominal prices/wages, stock markets, etc. Also, lots of variables within the computer world are time series: CPU clock speeds, hard drive capacities, number of programmers in the marketplace, etc.

These types of variables are interesting because they have certain properties. The main one is that they are highly correlated with one another when the observations are close together - for example, the average clock speed for a CPU sold in a certain year is going to be pretty correlated with the average clock speed for a CPU sold in the year after.
Compare this to something that isn't a time series, like the time it takes you to run a piece of code. Comparing the first test to the second test isn't really any different than comparing the first test to the hundredth test. However comparing CPU speeds between say 2001 and 2000 is a lot different than comparing CPU speeds between 2001 and 1970.

It turns out that this really messes up your results when you want to use linear regression to analyze relationships. I went off to Statistics Canada to get some data for an example (I can't publish the data here, since it isn't mine and it isn't open - you can change this by helping out with open data movement). I took two variables and checked out their relationship to one another over time, from 1980 to 2008. Using my trusty StatsAn tool, I was able to fit a line between two variables with a very strong level of statistical significance. I know it was strong because the t-statistic for this variable was 18.8 - the probability of getting this t-statistic with this sample size if there is no relationship is 0 if you round off at 6 decimal places. Yikes! What were the variables that could have had such a strong correlation?

The dependent variable in this example is the population of Canada, and the independent variable is the logarithm of the capacity of hard drives in gigabytes. According to my analysis, the capacity of hard drives over the last 30 years has been a very strong driver of the Canadian population growth.

Hmm.

I think something is wrong here.

And in fact, there is something quite wrong here. The thing that is very wrong is that there is an omitted variable bias. However in this case the omitted variable bias is a rather unique variable - time. Consider the following two models. The first one I did is this:
population_of_Canada = β0 + β1ln(hard_drive_capacity)
population_of_Canada = β0 + β1ln(hard_drive_capacity) + β2year
The added variable here is the year of the measurement. When you add year to the mix, the t-statistic for β1 goes from 18.8 to 0.65. With this sample size the probability of getting a t-statistic like this is over 50% when the relationship between hard-drive capacity and the Canadian population is non-existent (assuming that our model is correct). That's a pretty high probability! So basically the portion of Canadian growth that was actually due to time was being attributed to the increase in hard drive capacities over the years - thus giving us a strong positive link between the two.

Anyway the moral of this story is a fairly obvious fact: it is easy to find a "relationship" between two variables over time when you're not compensating for the fact that those variables are changing over time.

No comments: