May 29, 2008

Reduce

It seems like there are a lot of programmers out there who don't know or understand what reduce (in Ruby, inject; in Scheme, fold) is. In a computer world where parallelism is becoming increasingly important, I feel that this is a basic concept that should be understood.

Reduce is a higher order function for converting a collection of data into a single value.
It takes 3 parameters: a function f, a list L, and an initial value s. f takes two parameters and returns one.
Reduce passes s and the first element of L to f, then takes the return value from f and the second element of L and passes that to f, and so on. When it hits the end of the list, the return result of reduce is the return result of f. In pseudocode:
function reduce(f, L, s)
if L is empty return s
set v to s
for each element i in L
set v to f(v, i)
return v

or in functional-ish style:

reduce(f, L, s) = if f == []
then s
else reduce(f, tail L, f(s, head L)))
So what's the big deal with this? I can do this type of thing just fine with a for/foreach loop and a variable. For a single processor or thread, it is probably slower if you use reduce. Also for certain functions, reduce doesn't work very well either. This is pretty useless.

Let's look at this mathematically for a second. If f is associative, then the following is true:
f(a, f(b, c)) = f(f(a, b), c)
Examples of functions that are associative are: addition, max/min functions, merge sort, etc. On top of this, if f is commutative ( f(a, b) = f(b, a) ), then we don't even need to keep track of the order.
Assuming that f is associative, and s is the identity value of f (for all possible values a that can be passed to f, f(a, s) = f(s, a) = a), then the following holds:
reduce(f, L, s) = f(reduce(f, L1, s), reduce(f, L2, s))
where L1 and L2 are the two halves of L (if f is commutative, it doesn't matter the order in which L1 and L2 come). The great thing about this is that this perfectly scales on two processors. One processor does the first reduce, the other processor does the second reduce. Voila, we have just halved the time it takes for us to process this list!

We can take this further. Suppose we have n machines and a big dataset. We can split our dataset into n equal parts and tell each machine to reduce a part, and then one machine puts all the reduce results into a list and reduces that list. The time it took to process our dataset has been divided by n!

We can take this even further. Suppose we now have a dataset bigger than Sidney Crosby's ego (if that's possible). Designate one machine as a master (make sure it is a super machine that is reliable). This machine then takes a piece of our dataset, splits it up into chunks and assigns it to a bunch (or maybe several thousand) of slave machines. Those slave machines then reduce each little piece that they were given, as before. However this time, the master occasionally polls each machine to see if that machine has finished. If it has, then it grabs a new piece of the data set and assigns it to that machine. However if the machine doesn't respond, the master can assume that machine has failed and give the piece that the slave was working on to a different slave. So now we don't have to worry about our slaves failing.

So what has reduce given us? A massively scalable, failure-tolerant processing system. Who uses things like this? Google immediately comes to mind (their algorithm is called MapReduce, after all).

Reduce isn't the only function that scales like this, just seems to be the one that is the most difficult to understand. Reduce is a method of converting a list into a single value. Two others that come to mind are map and filter. Map takes a list and a function and applies the function to each element, returning the list with the modified elements. Filter simply takes a boolean function and a list and finds all the elements in the list for which the function is true.

These types of concepts applied are great ways to make easily scalable code and are things that programmers should be familiar with.

2 comments:

Edward said...

I love Enumerable#inject and only after reading this did I remember that it’s also called reduce in other parlance. Thanks Rob!

Hey, can I make a request? Cover Enumerable#zip and its equivalents. I realized the other day how useful this thing was and how the name finally makes sense to me while I was comparing files in two different directories by doing something like

Dir['something/*'].zip(Dir['something_else/*']).each {|first, second| puts `cmp #{first} #{second}` }

Rob Britton said...

Zip is handy, although I've never used it in Ruby (my experience with it was only in Haskell). Although a quick peek at the Ruby docs gives you enough info. In fact you don't even need to put .each there, as zip can receive a block as well and does the exact same thing.

What zip does is it packages 1 or more lists into a list of arrays, where each element of the result is an array containing the corresponding elements of in that position of the parameter lists. In Ruby if you pass a block to zip, then instead of returning an array it applies the block to the group of elements.

If the lists have a different size, I believe it is implementation dependent on what happens. Ruby returns a list with the same size as the longest list, with the elements of the shorter lists being nil. Haskell returns a list with the length of the shorter list, and discards the other data. So I'm not sure.