Feb 24, 2010

The Many Axes of Scalability

Many times I've heard talk about scalability, and questions about whether technology X can scale, etc. For the most part I've taken the meaning of this word for granted, but seemingly like many things in the computer world it has multiple definitions and when people talk about scalability they are sometimes referring to different things. So this here is my attempt to document some of the definitions of scalability. This is by no means an exhaustive or perfect list, so feel free to comment with a definition of your own.

The first two are more technical in nature, as they refer to the system itself. The second two are more social in nature, but are important nonetheless and I've heard people refer to scalability of a system/language when it was referring to one of these two human definitions.

The first is when it comes to dataset size. This definition of scalability refers to how the system responds to an increase in the size of the dataset. If we increase the size of the dataset by an order of magnitude, will the time it takes to process things also increase by an order of magnitude? Or will it be higher, or lower? For example, a system with O(n) processing time will increase by an order of magnitude, where a system with O(log n) processing time will increase at a constant rate for each order of magnitude.
In order to improve this sort of scalability, you usually need to change the algorithm you're using. This requires analysis of both your code and the libraries you're built upon to determine the computational complexity relative to your dataset. When it comes to your code, usually the language you use isn't a huge issue when compared to your algorithm (although if you're using a good algorithm you would probably see some performance increase moving to a faster language). When it comes to your libraries, the most important aspect is the ability to replace certain slow portions of the libraries. Open-source libraries are great for that here, otherwise a good modular design where you can replace certain components if necessary is a good bet.

The second one is with traffic. You don't need to be working with large datasets to see scalability problems, if your system is getting hit too quickly then other problems may occur. This was the case when I was working on Pornhub, we didn't have huge datasets, however the rate at which the database was being hit was the main cause of scalability problems.
In this case you basically have a lower bound which is the processing time of the simplest action you can do. If the system is getting hit faster than this lower bound, then you will begin to notice traffic issues.
A few ways to alleviate this problem are:
1) Concurrency - set it up so that the system can handle multiple things at the same time if needed. This is a decent way to scale, but has diminishing returns. Going from one to two concurrent processing units will double the amount of traffic you can handle (provided good load-balancing of course), however going from two to three decreases the benefit, and going from 100 to 101 has little benefit at all. It will always have some benefit, but the effect gets much smaller as you increase the number of processing units.
A problem with increasing the number of processing units is managing the interactions between them. At some point it's possible that the cost of managing concurrent units will outweigh the benefit of having them.
2) Reducing the lower bound - the lower bound is determined by the slowest aspect of your system (commonly known as a bottleneck). There is no formula for reducing the lower bound other than identifying the bottlenecks and finding a way around them. Sometimes you can't completely get around a bottleneck, however you can reduce the amortized lower bound by avoiding that bottleneck as much as possible. An example of this is caching - your lower bound is usually a slower medium like a hard-drive, database, or the Internet, and you can reduce the amortized lower bound by storing a temporary version of your data in a faster medium like RAM (or in the case of a processor, in cache memory).

A third definition is with team size. If you hold all else equal and increase the number of programmers on a team, each added programmer will have a lower benefit than the one before (in economics this is called diminishing marginal productivity of labour, it's observed in many more fields than just programming). In fact you can get to a point where adding programmers to a project brings about a net negative productivity boost as the other programmers have to spend time getting the new programmer up to speed - hence the programming proverb, "adding manpower to a late project makes it later".
The rate of decrease depends on many factors, including the nature of the project, the skill/experience of the programmers involved, the design of the system, and the language/framework being used (listed in what I believe but cannot prove at the moment the order of the magnitude of the effect from most important to least). However the rate of decrease tends to be specific to the project, and so in order to improve this sort of scalability you will need to analyze your project to determine what can be done to improve the scalability.

A fourth definition which is related to the third is scalability with respect to programmer ability and experience. Keep the number of programmers constant but replace a good programmer with someone worse, and the effect on the system will change (obviously). As a side note, you don't usually fire good programmers and replace them with bad programmers, however when you are adding a member to the team (hence why I said this is related to the third definition) the effect of the ability/experience of that new member on the productivity of the team is what I mean by this sort of scalability.
A system is more scalable in this regard if when you decrease the ability/experience of a programmer on the team, the drop in the productivity of the team (assuming of course, that productivity and ability/experience are positively correlated) is less than it would be in a less scalable system.
What can you do to improve this sort of scalability? As before, this depends on a large number of factors, including the nature of the project, the development practices, the design of the system and the language/libraries used1. As with the previous definition, the best way to improve this sort of scalability is with analysis of your project.

In order to improve any of these types of scalability, the most important thing is analytical skills. Your ability to look at your system and analyze what is a limiting is a necessary condition for each type of scalability. If you cannot analyze the system properly, then you don't really have much hope for improving your system. You can always just follow other people's advice and good practices (and it is a good idea to do this regardless), but they are not guaranteed to help you in all cases.
On the other hand, good analytical skills alone are not sufficient to improve the scalability of a system. It's possible that you have actually exhausted the possible improvements, and there is a fundamental limit to what you are trying to do.
Finally, you cannot always have the best in every case. Sometimes you have to reduce one type of scalability in order to improve another type. It's possible that in order to make your project easier to manage for more people, it has to become slower and less memory-efficient. This is not guaranteed, but it is possible.

Anyway as I mentioned before this is not an exhaustive list of the definitions of scalability, but a list of a few of them that I've heard people talk about. Feel free to add your comments below.

1I know that some people will hate me for saying this, and I can't really back it up with data, but my suspicion is that with all else equal dynamically typed languages like Ruby and Python do not scale in this regard as well as statically typed languages. I say this because the techniques for avoiding typing problems in dynamically typed languages rely on the programmer's discipline and experience with this type of programming instead of an automatic process. An automatic process is much more reliable. Feel free to argue this, although if you don't have a good argument I'll probably ignore you.

Feb 18, 2010

Github Issue Closing

A little thing I discovered today, when you push something to Github and you're using their issue tracker, you can put "fixes #X" (X being the issue number) in the commit message and Github will close the issue. I know Redmine does this with SVN, and it's cool to see Github does it too.

Feb 15, 2010

Statistics For Programmers III: Differences of Means

This entry is part of a larger series of entries about statistics and an explanation on how to do basic statistics. You can see the first entry here. Again I will say that I am not an expert statistician, so feel free to pick apart this article and help bring it closer to correctness.

One thing that I've noticed programmers (including myself) do when comparing two versions of code is they run it once for both versions and compare the time it takes. If one is faster, then they conclude that that one is faster and take it.

It is possible that this conclusion is correct, that the faster one is indeed actually faster, however the evidence here is not very good. It is also highly possible that the faster one is only faster because some randomness in the execution time made it seem faster. Take a look at this picture:


Suppose this is the graph of the runtimes of two versions of the same function. The variance is unchanged at 16 (so the standard deviation is 4, which means that the average distance from the mean for a test is 4). The green version of the function has an average runtime that is twice as much as the red version, meaning that it is twice as slow. However it is highly possible that your test for the green function ends up in the left tail and the test for the red function ends up in the right tail, meaning that you'll wrongly conclude that the green version is faster. (Note that another good idea here would be to try and reduce the variance of your function).

In order to get a good idea of which one is actually faster, you need to do some analysis, and for this analysis you need to know the variance of the running time. Unfortunately you cannot get an estimate of the variance with one test (mathematically it results in a divide by zero) so you need to have several tests for both versions in order to create a decent sample. How many tests? That depends on a great many factors, but if we assume that the randomness is a random variable that normally distributed (I'll give a tiny bit of info at the end about what happens if it isn't) then the number of tests depends on what level of confidence you want in your results, and how precise you want your estimates to be. If you want to be 95% sure you are correct instead of 90% sure, you'll either need to accept a lower level of precision or run more tests. On the other hand if you want to be more precise, you'll need to either accept a lower level of confidence in your results, or run more tests.
Finally, the number of tests you'll need depends on the variance of the randomness in your tests. A higher variance will need more tests in order to be more precise.

So let's say we've got some data for our two versions of the code and we want to know which one is faster. Unfortunately it is not as simple as taking the average from one and subtracting the other, since this doesn't take into account the variance of the two distributions.

One way to do this is to compute a new dataset containing the differences between the two distributions. You would do this with the formula (assuming xt is the tth observation from one version of the code and yt is the tth observation from the second version):
dt = xt - yt
You can then take the average and variance of the ds and make a conclusion. We want to test if the difference is equal to zero, because if it is then we cannot say that there is a difference between these two samples. We do this by taking a t-statistic, which uses the following formula (assuming d is the difference, and n is the size of d):
t = average(d) / sqrt(variance(d) / n)
You then compare this value to the t-distribution with n - 1 degrees of freedom (I'll explain what the degrees of freedom are in a future post) to see what is the probability that we would get this t-statistic assuming that the actual mean of the difference is zero. If this probability is very low, then we can say with a certain level of confidence that we are wrong with our assumption that the mean of the difference is zero. This probability is called the p-value (some people call it the power of the test), and using R you can compute this p-value:
2 * pt( -abs(t), df = n - 1)
This will give you a value between 0 and 1, which is the probability that if the mean of the difference was actually zero, there is a p-value percent probability of us getting a sample that gives us this t-statistic. If this probability is really low, then we can say that it is more likely that our assumption of zero mean is wrong. If the difference is not zero, that means that one of the two versions of our code is faster, which one being the one with a lower average runtime.

You can also do this using confidence intervals. This is where you compute a range, and if zero is in this range then you cannot say that the difference is not zero. With a confidence interval you need some t-critical value for your level of confidence. If you want to be C% confident, you can use the formula in R to get the t-critical value (you divide by 2 because the qt() function in R is for a one-tailed test):
abs(qt((1 - C) / 2, df = n - 1))
You can then create your confidence interval:
lower_bound = average(d) - tcrit * sqrt(variance(d) / n)
upper_bound = average(d) + tcrit * sqrt(variance(d) / n)
If zero is in this interval, then you cannot say that there is a difference between the runtimes of the two functions.

A quick note on normality: we assume here that the randomness in the runtime is normally distributed. If we don't know that it is normal, that means that the "t-statistic" we calculate does not actually follow a t-distribution and the results we get from computing t-statistics and p-values is pretty meaningless. However this problem can be remedied by using a large sample, because as the sample size goes to infinity the distribution of our "t-statistic" converges to the normal distribution, and we can still use the p-value. This also requires a few assumptions, which I will talk about in later posts.

« Statistics For Programmers II: Estimators and Random Variables | Statistics For Programmers IV: Ordinary Least Squares »

Feb 14, 2010

Jarm

Well, it's time to announce yet another pet project of mine. This time it is a little Javascript game using gameQuery where you wander around planting stuff. I guess it's kinda like that Farmville game that is making its way around Facebook (I personally have no clue what that game is about, I avoid Facebook applications like the plague), however I got the idea more from a board game called Agricola, although this game and my game are not all that similar.

It's still far from finished, all you do right now is go around finding seeds, plant them, watch them grow, and then pick them. I'm hoping to add some sort of money system so that you can buy stuff to advance your garden-growing skills, and also some random events that could happen that will hinder your garden's growth. Hopefully someday I'll get some sort of points system so that maybe after X amount of time the game will end and your points are tallied up, and you have to somehow beat your old high score.

You can check out the source at GitHub if you're interested in playing it. It runs just fine in Firefox 3+ and Chrome (both tested on Linux only, not sure about anything else).

Feb 11, 2010

gameQuery

I've been working recently with a jQuery plugin called gameQuery, which is a basic 2D game development library for Javascript. It is designed to help you make simple browser games without the use of Flash or Java - which is nice, since Flash development doesn't work so well in Ubuntu (Adobe does provide an ActionScript compiler for Linux, but it doesn't really have any tools with it) and Javascript is a lot easier to program in than Java.

Right now it is still very basic, however it handles certain primitives for you quite well. Animations are implemented for you using tiled images (this is where you put all your frames into a row/column in a single image) which is very handy and much better than using animated .gifs. It will also do the preloading for those animations so that they don't need to be done during the gameplay. There are also mechanisms for grouping different animations into layers so that you can have different background/foreground groups, sprite groups, etc.
There's also some nice little addons for the game loop and input handling.

Other than that, there isn't really much to it. It has sound, although it is experimental and is not guaranteed to work (I haven't tested that part out yet). Supposedly it has collision detection, but it doesn't seem to work for me. Maybe I'll fix it and submit a patch.

I've been expanding on it a lot, I've made up two view classes that handle a scrolling background. You can either just straight up scroll by calling scroll(dx, dy), or you can have a "locked view" where it follows a particular sprite (it nicely handles the edge of the screen!).
I've also built-in a more efficient object management system using quad-trees, which are a space-partitioning data structure. It means that when you do location-dependent things like collision-detection you don't need to query every single object in the game, you just need to update the visible ones. This might also be good for animation rendering since you wouldn't need to update animations that are not visible.

Anyway give it a shot if you like, but remember that it is still pretty alpha! And don't bother with IE since it is really too slow for something like games.

Feb 9, 2010

On Converting People to Ubuntu

Personally I don't care too much if others like Ubuntu or not, so I don't really attempt to push it on any of my friends - back when I was still a Windows guy I absolutely hated Linux fanboys always trying to force it on me, citing irrelevant reasons like "it's free!" or something like that (my response was "yes, but does it work?", although that leads into a complicated discussion of the definition of "working" and we won't go there today). These days freedom is a lot more important, but I still don't think that it should be forced on them - "you have to understand, most of these people are not ready to be unplugged":

I guess this video isn't completely relevant to what I'm saying, I just like the message in that scene :) also the ending of that clip is good.

Anyway getting to my main point here, I think that many of the Linux geeks who preach about Linux would do well to read Ubuntu's guidelines on converting users: https://wiki.ubuntu.com/ConvertFriends.

Now I'll make a playful jab at my Mac user friends ;) scroll up a bit and read my sentence that says "freedom is a lot more important." That's one of my main reasons for not going with Apple - I feel bad enough owning an iPod Shuffle.

Back to Jaunty

I manage to brick my Ubuntu install today by switching the home partition from XFS to ext3. Don't ask why I had my home partition as XFS, it was one of those "it seemed like a good idea at the time" things and I really don't have much of a reason other than that. Why did I switch it back? Well, I like having the ability to shrink a partition and since my Windows one is full of games, the home partition is really the only one left. I was thinking of sticking the Lucid alpha in there to see what happens. Second, I couldn't seem to find an easy-to-install XFS driver for Windows, and sometimes it is handy to be able to access your home folder from Windows - read-only of course, I have some kind of paranoia about Windows and I'm afraid that if I allow the Windows partition to have write-access to my Ubuntu partitions then all hell will break loose.

Then I switched back to Jaunty because my Karmic LiveCD wouldn't boot.

How do I feel now that I'm back to Jaunty? This is probably because it's a brand new install (or maybe ext3 is just that much faster than XFS), but it's speedy. It boots way faster than Karmic did. Things are much more snappy, Firefox starts right away, and things actually respond when I click on them and not 5-10 seconds later. It's nice. Oh, and my webcam works again since I'm back on a 2.6.28 kernel.
What's bad? Old versions of things - I'm back on Firefox 3.0 which is a bit slower than 3.5. I can probably upgrade, but that is something for later. I've also noticed some slight rendering differences on pages that I had open before and after the reinstall, which is interesting. I didn't think that there was that much of a difference. Other than that, I don't really have a problem with going back, and I might just hang out until Lucid comes out to give that a shot!

Feb 6, 2010

Programming the Wiimote in Ubuntu

This post has been a long time in coming, but I finally got up off my ass and decided to figure out how to code using a Wiimote in Ubuntu.

First, you need a Wiimote (obviously). Note that you don't actually need a Wii to do this, you can just head on down to Future Shop or whatever and grab a Wiimote. It should still work.
Second you need a Bluetooth adapter. I've got a RocketFish micro-dongle thing which works just fine, I'll leave it to you to find something for your machine.
Third, I'm going to assume you have Bluetooth working on your computer, otherwise there isn't much hope for you since the Wiimote is a Bluetooth device.

Let's start by grabbing the required libraries. In Ubuntu you can go:
sudo apt-get install lswm libcwiid1-dev libbluetooth-dev wmgui
To test to see if your adapter can detect the Wiimote, just run this command and press the 1 and 2 buttons:
lswm
If it displays a bunch of hex characters, then your Wiimote has been detected and you're set to go. If it doesn't display it right away and the LEDs are still flashing on the Wiimote, run it again since lswm doesn't wait very long and you might have missed it.

You can also test it out using wmgui, which basically shows you all the various sensors the Wiimote has in it. This includes the buttons, accelerometer and IR sensor. It's pretty easy to get it going, the only issue comes when you want to start using the IR. Since Wii sensor bar is actually just a bunch of infrared LEDs, you just need an IR light source. Unfortunately since the Wii bar doesn't plug into my computer (it has some weird plug that connects to the Wii), I went with something simpler - a candle. If you actually start programming using the IR it'd probably work better if you used two candles so that you can detect the orientation of the Wiimote, but it still works fine with one.

Now let's get to some code! I'll be using C with gcc. There's a Python version of this library but I haven't tested it and don't know how it works.
First, we include the needed libraries:
#include <bluetooth/bluetooth.h>
#include <cwiid.h>
We then set up the various structs we need to interact with the Wiimote:
bdaddr_t bdaddr = *BDADDR_ANY;
cwiid_wiimote_t * wiimote = NULL;
We obviously need a cwiid_wiimote_t struct in order to call functions on it, we also need the bdaddr_t struct to give us the Bluetooth code for the Wiimote (remember those hex digits that lswm printed out?).

Next we set up the Wiimote. Normally you would wrap these up in if statements for error checking, but I'm going to leave those out for clarity:
/* Connect to the Wiimote - the CWIID_FLAG_MESG_IFC
means that we want to use a callback method for
interacting with the device */
wiimote = cwiid_open(&bdaddr, CWIID_FLAG_MESG_IFC);

/* Now we register a callback. I'll define this later. */
cwiid_set_mesg_callback(wiimote, &wii_callback);

/* Set the reporting mode. This says which messages we want
to receive in our callback. Let's only get buttons for now. */
cwiid_set_rpt_mode(wiimote, CWIID_RPT_BTN);
Then we define our callback function. I'll just make it so that when you press B the program will quit:

void wii_callback(cwiid_wiimote_t * wiimote, /* The Wiimote struct */
int msg_count, /* The number of messages received */
union cwiid_mesg msgs[], /* The messages themselves */
struct timespec * timestamp /* The timestamp of the callback */){
int i;
for (i = 0; i < msg_count; i++){
if (msgs[i].type == CWIID_MESG_BTN){
if (msgs[i].btn_mesg.buttons & CWIID_BTN_B){
printf("B button pushed!\n");

/* Close the connection to the Wiimote and quit */
cwiid_close(wiimote);
exit(0);
}
}
}
}
The way the callback works is that it sends you an array of messages that happened. The message is a union of different types of messages, determined by the type field. For the button message, it includes an integer that for each bit it says whether or not each button is pushed. There are many other types of events, you can check it all out by browsing the cwiid source code (in Ubuntu it installs to /usr/include/cwiid.h) and by looking at the cwiid docs. Have fun playing with the Wiimote!

If you want to enable other functions, you just use a binary OR for the reporting mode:
/* Enable the buttons, accelerometer, and nunchuk */
cwiid_set_rpt_mode(wiimote, CWIID_RPT_BTN | CWIID_RPT_ACC | CWIID_RPT_NUNCHUK);