Nov 22, 2010

Stack Overflow

Now is the perfect time to stop using Stack Overflow unless my reputation score changes:

Nov 17, 2010

Software Metrics and Instrumental Variables

I recently read this article about how software metrics are mostly useless and tend to cause more problems than they solve. This reminded me of a topic in stats which apparently has some application in software development.

The use of software metrics is an example of a statistical technique called instrumental variables. Often when you want to understand some phenomenon or relationship, you run into problems because many factors are unobservable. This means that they are not concrete things that you can stamp a number on to get a clear measurement of the factor. One example that constantly crops up in economics and software development is ability. A person's ability can have extremely strong effects on other factors such as productivity, wage, etc. However, you can't really come up with a solid measurement for a person's ability: suppose some programmer you know has an ability of 10. What's that mean?
Compare that to a metric like lines of code per hour. A measurement of 10 has a very clear and concrete meaning: given the changes that they made in the hour, the code they have produced contains 10 newlines.

This is where instrumental variables come in. An instrument is an observable variable that you use in place of the unobservable variable that satisfies two characteristics: it has to be correlated with the variable that it is standing in for, and it can't be correlated with random errors or other omitted factors. The power of the instrument is based on the strength of the correlation between the instrument and the unobserved variable. This is why people use things like lines of code per hour, years of experience, etc. etc. for attempting to measure the productivity of a programmer.

Unfortunately there are a number of shortcomings with the instrumental variables method. The biggest issue is finding a good instrument. We know the criteria required for a good instrumental variable (there's all sorts of math proofs that you can look up if you like), however that doesn't mean that there are any instruments that satisfy it. On top of that when dealing with people who know the metric you're using, they can perhaps attempt to cater to the metric - thus introducing a correlation between the instrument and other omitted variables like their ability to cater to metrics.

In short, the problem is not entirely with the method, but more with finding good instruments. Unfortunately if you can't find a good instrument, you'll have to resort to a different method. According to a stats professor of mine, one method for dealing with unobservable factors is to use what's called a mixture model. It supposedly works, however the procedure appears to be much more complicated and can be less precise than having a good instrument. I'm still working on figuring out how to do this sort of thing, perhaps I'll talk a bit more about it another day.

Nov 16, 2010

Parse vs. TryParse

I was having a problem not too long ago where the performance of a specific bit of code was abysmally slow. I scanned through it to understand theoretically why it might be going slow, and then ran it through a profiler to see where it was spending most of its time. The results were very strange: most of the time was spent parsing strings into integers and decimals (for those who have never done .NET, they have a decimal type in addition to floats and doubles which are better for processing financial data - with decimals, 0.1 + 0.7 actually does equal 0.8). This was very odd, so I figured I'd look at it in more detail.

It turns out that in .NET, the regular Parse() functions throw an exception when the string is not well-formed. If the string is "N/A", then an exception is thrown. This is a common occurrence in the data I was receiving, so I was just catching the exception and setting things to null right away. The problem was that this code was being executed hundreds of times a second, the overhead from exception throwing was adding up!

In high-performance situations, you're better off using a function like TryParse(), which accepts the int/decimal variable as a reference and returns true if the parse was successful. This does not throw an exception and thus is much faster - a massive performance increase was noticed!

While my example here is with .NET, it applies to pretty much any language where a string to int/float method throws an exception when the string is malformed. If your language has something like TryParse, it is definitely recommended over Parse for high-performance situations!

Or an even better moral: don't throw exceptions in code that needs to be fast!

Nov 11, 2010

Why I Prefer Linux Servers

Say what you will about the quality of Linux servers vs. Windows servers, etc. etc. but here is one huge reason why Linux servers are better (all prices in USD):

Windows Server 2008 Licence with 5 users (not accessible through Remote Desktop): $1 029
To enable 5 users with Remote Desktop: $749
To add 5 more users: $199
To allow those 5 more to use Remote Desktop: $749

Ouch.

Nov 7, 2010

Writing Blog Posts

Supposedly this month is NaBloPoMo, which means "National Blog Posting Month", where you are supposed to post one article a day for a month. Besides already having failed at it, I'm not sure how I feel about this idea. You shouldn't write a post just for the sake of writing a post, you should have a reason to write it. So I've decided to write about some of the reasons why I post:

1) There's something that interests me. Typically this is something science-y or math-y or computer-y or whatever. This is the #1 reason to post, since you can actually write good articles this way - when you're interested in something, you tend to put a fair bit more effort into the post than you would otherwise. On top of that if something interests you then chances are it will interest other folks, so it's a good way to build up a readership which in turn can give you more ideas through comment feedback, link sharing, etc.

2) I've found some information that I want to share with people. This is usually little updates like FireSheep or little howtos like installing Rubygems on Ubuntu. For the latter it is usually, "I've figured out how to do X with technology Y (ie. Ubuntu, Ruby), it was a pain in the ass so this is how to do it the easy way." Interestingly enough, the traffic generated by these kinds of posts is the majority of the traffic of this blog. This type of post is good if you want a steady stream of traffic and if you feel all warm inside when people leave comments saying "thank you so much!"

3) Jokes I've thought of. These ones are rare since I'm not much of a funny guy, but maybe you can come up with this type of thing more often!

Anyway that pretty much makes up most of my reasons for posting. The final reason is rants, but I think the Internet is full enough of that kind of post, so I won't encourage it here - however I will probably still end up occasionally ranting now and then...

So for those of you wanting to do NaBloPoMo, I hope this helps!