Reduce is a higher order function for converting a collection of data into a single value.
It takes 3 parameters: a function f, a list L, and an initial value s. f takes two parameters and returns one.
Reduce passes s and the first element of L to f, then takes the return value from f and the second element of L and passes that to f, and so on. When it hits the end of the list, the return result of reduce is the return result of f. In pseudocode:
function reduce(f, L, s)So what's the big deal with this? I can do this type of thing just fine with a for/foreach loop and a variable. For a single processor or thread, it is probably slower if you use reduce. Also for certain functions, reduce doesn't work very well either. This is pretty useless.
if L is empty return s
set v to s
for each element i in L
set v to f(v, i)
or in functional-ish style:
reduce(f, L, s) = if f == 
else reduce(f, tail L, f(s, head L)))
Let's look at this mathematically for a second. If f is associative, then the following is true:
f(a, f(b, c)) = f(f(a, b), c)Examples of functions that are associative are: addition, max/min functions, merge sort, etc. On top of this, if f is commutative ( f(a, b) = f(b, a) ), then we don't even need to keep track of the order.
Assuming that f is associative, and s is the identity value of f (for all possible values a that can be passed to f, f(a, s) = f(s, a) = a), then the following holds:
reduce(f, L, s) = f(reduce(f, L1, s), reduce(f, L2, s))where L1 and L2 are the two halves of L (if f is commutative, it doesn't matter the order in which L1 and L2 come). The great thing about this is that this perfectly scales on two processors. One processor does the first reduce, the other processor does the second reduce. Voila, we have just halved the time it takes for us to process this list!
We can take this further. Suppose we have n machines and a big dataset. We can split our dataset into n equal parts and tell each machine to reduce a part, and then one machine puts all the reduce results into a list and reduces that list. The time it took to process our dataset has been divided by n!
We can take this even further. Suppose we now have a dataset bigger than Sidney Crosby's ego (if that's possible). Designate one machine as a master (make sure it is a super machine that is reliable). This machine then takes a piece of our dataset, splits it up into chunks and assigns it to a bunch (or maybe several thousand) of slave machines. Those slave machines then reduce each little piece that they were given, as before. However this time, the master occasionally polls each machine to see if that machine has finished. If it has, then it grabs a new piece of the data set and assigns it to that machine. However if the machine doesn't respond, the master can assume that machine has failed and give the piece that the slave was working on to a different slave. So now we don't have to worry about our slaves failing.
So what has reduce given us? A massively scalable, failure-tolerant processing system. Who uses things like this? Google immediately comes to mind (their algorithm is called MapReduce, after all).
Reduce isn't the only function that scales like this, just seems to be the one that is the most difficult to understand. Reduce is a method of converting a list into a single value. Two others that come to mind are map and filter. Map takes a list and a function and applies the function to each element, returning the list with the modified elements. Filter simply takes a boolean function and a list and finds all the elements in the list for which the function is true.
These types of concepts applied are great ways to make easily scalable code and are things that programmers should be familiar with.