Feb 24, 2010

The Many Axes of Scalability

Many times I've heard talk about scalability, and questions about whether technology X can scale, etc. For the most part I've taken the meaning of this word for granted, but seemingly like many things in the computer world it has multiple definitions and when people talk about scalability they are sometimes referring to different things. So this here is my attempt to document some of the definitions of scalability. This is by no means an exhaustive or perfect list, so feel free to comment with a definition of your own.

The first two are more technical in nature, as they refer to the system itself. The second two are more social in nature, but are important nonetheless and I've heard people refer to scalability of a system/language when it was referring to one of these two human definitions.

The first is when it comes to dataset size. This definition of scalability refers to how the system responds to an increase in the size of the dataset. If we increase the size of the dataset by an order of magnitude, will the time it takes to process things also increase by an order of magnitude? Or will it be higher, or lower? For example, a system with O(n) processing time will increase by an order of magnitude, where a system with O(log n) processing time will increase at a constant rate for each order of magnitude.
In order to improve this sort of scalability, you usually need to change the algorithm you're using. This requires analysis of both your code and the libraries you're built upon to determine the computational complexity relative to your dataset. When it comes to your code, usually the language you use isn't a huge issue when compared to your algorithm (although if you're using a good algorithm you would probably see some performance increase moving to a faster language). When it comes to your libraries, the most important aspect is the ability to replace certain slow portions of the libraries. Open-source libraries are great for that here, otherwise a good modular design where you can replace certain components if necessary is a good bet.

The second one is with traffic. You don't need to be working with large datasets to see scalability problems, if your system is getting hit too quickly then other problems may occur. This was the case when I was working on Pornhub, we didn't have huge datasets, however the rate at which the database was being hit was the main cause of scalability problems.
In this case you basically have a lower bound which is the processing time of the simplest action you can do. If the system is getting hit faster than this lower bound, then you will begin to notice traffic issues.
A few ways to alleviate this problem are:
1) Concurrency - set it up so that the system can handle multiple things at the same time if needed. This is a decent way to scale, but has diminishing returns. Going from one to two concurrent processing units will double the amount of traffic you can handle (provided good load-balancing of course), however going from two to three decreases the benefit, and going from 100 to 101 has little benefit at all. It will always have some benefit, but the effect gets much smaller as you increase the number of processing units.
A problem with increasing the number of processing units is managing the interactions between them. At some point it's possible that the cost of managing concurrent units will outweigh the benefit of having them.
2) Reducing the lower bound - the lower bound is determined by the slowest aspect of your system (commonly known as a bottleneck). There is no formula for reducing the lower bound other than identifying the bottlenecks and finding a way around them. Sometimes you can't completely get around a bottleneck, however you can reduce the amortized lower bound by avoiding that bottleneck as much as possible. An example of this is caching - your lower bound is usually a slower medium like a hard-drive, database, or the Internet, and you can reduce the amortized lower bound by storing a temporary version of your data in a faster medium like RAM (or in the case of a processor, in cache memory).

A third definition is with team size. If you hold all else equal and increase the number of programmers on a team, each added programmer will have a lower benefit than the one before (in economics this is called diminishing marginal productivity of labour, it's observed in many more fields than just programming). In fact you can get to a point where adding programmers to a project brings about a net negative productivity boost as the other programmers have to spend time getting the new programmer up to speed - hence the programming proverb, "adding manpower to a late project makes it later".
The rate of decrease depends on many factors, including the nature of the project, the skill/experience of the programmers involved, the design of the system, and the language/framework being used (listed in what I believe but cannot prove at the moment the order of the magnitude of the effect from most important to least). However the rate of decrease tends to be specific to the project, and so in order to improve this sort of scalability you will need to analyze your project to determine what can be done to improve the scalability.

A fourth definition which is related to the third is scalability with respect to programmer ability and experience. Keep the number of programmers constant but replace a good programmer with someone worse, and the effect on the system will change (obviously). As a side note, you don't usually fire good programmers and replace them with bad programmers, however when you are adding a member to the team (hence why I said this is related to the third definition) the effect of the ability/experience of that new member on the productivity of the team is what I mean by this sort of scalability.
A system is more scalable in this regard if when you decrease the ability/experience of a programmer on the team, the drop in the productivity of the team (assuming of course, that productivity and ability/experience are positively correlated) is less than it would be in a less scalable system.
What can you do to improve this sort of scalability? As before, this depends on a large number of factors, including the nature of the project, the development practices, the design of the system and the language/libraries used1. As with the previous definition, the best way to improve this sort of scalability is with analysis of your project.

In order to improve any of these types of scalability, the most important thing is analytical skills. Your ability to look at your system and analyze what is a limiting is a necessary condition for each type of scalability. If you cannot analyze the system properly, then you don't really have much hope for improving your system. You can always just follow other people's advice and good practices (and it is a good idea to do this regardless), but they are not guaranteed to help you in all cases.
On the other hand, good analytical skills alone are not sufficient to improve the scalability of a system. It's possible that you have actually exhausted the possible improvements, and there is a fundamental limit to what you are trying to do.
Finally, you cannot always have the best in every case. Sometimes you have to reduce one type of scalability in order to improve another type. It's possible that in order to make your project easier to manage for more people, it has to become slower and less memory-efficient. This is not guaranteed, but it is possible.

Anyway as I mentioned before this is not an exhaustive list of the definitions of scalability, but a list of a few of them that I've heard people talk about. Feel free to add your comments below.

1I know that some people will hate me for saying this, and I can't really back it up with data, but my suspicion is that with all else equal dynamically typed languages like Ruby and Python do not scale in this regard as well as statically typed languages. I say this because the techniques for avoiding typing problems in dynamically typed languages rely on the programmer's discipline and experience with this type of programming instead of an automatic process. An automatic process is much more reliable. Feel free to argue this, although if you don't have a good argument I'll probably ignore you.

No comments: