Apr 16, 2013

Distributed Coding at sweetiQ

It's been a long time since I really wrote anything on this blog, so it is time for a little update.

Just over a year ago I joined another startup called sweetiQ. It's a step up from the other startups that I've worked for in that it has actually gone live and it is actually making money. It's also interesting because this is the longest amount of time I've spent working for a single company.

The system uses RabbitMQ for driving a distributed computing cluster that essentially implements Actor model of computation, except rather than working with threads within a single machine we're using multiple processes on multiple machines (100+). Each process has a set of endpoints that listen on different queues, RabbitMQ manages dispatching these messages to the different worker nodes.

One of my major contributions is what I'm calling a "distributed job control system." There are many steps in each "job" that the system must do, and each step of the job may be handled by a different process across different machines. As a particular computing job becomes more complex, several problems arise:

  • managing the dependency structure between various components of each job. In a simple sequential system you can do A, then B, then C; the dependency structure is very simple: if x depends on y, do y after x. In a distributed system B may not depend on A, so you can do them in parallel, but C depends on the output of both A and B so the system cannot start C until both A and B are done. It is not possible to just have the node handling C wait for a response of both A and B, because the messages from A and B may not be delivered to the same node that can handle C. Even more complex, in certain cases C may not need the computation from A but in other times it does - for example if we're aggregating social data from different networks but the user hasn't linked their Twitter account, we don't need to fetch data from Twitter.

  • the possibility of failure increases - sometimes a node will lose its database connection. Sometimes a node will die (exmaple: we use Amazon spot instances that at any time can just shut down). Sometimes a node that attempts to fetch something on the Internet may fail for whatever reason (the API version of Twitter's fail whale is something that has happened relatively frequently). In this case the process needs some sort of elegant failure handling - often the solution is just to try again later. Other times we need to send a message to the user informing them that they need to take some sort of action (if a user changes their Facebook password, it will invalidate the stored OAuth token).

  • need to rate-limit certain jobs - some jobs can have many sub-components done in parallel, we have encountered some problems sending out too many messages in parallel. The first and most obvious problem that we hit is that the database will choke, however once we learned how to use MongoDB properly this became a non-issue (having scaled both MySQL databases and MongoDB databases, I can tell you I am fully sold on MongoDB at this point). The bigger issue was a problem of process starvation: at peak times jobs will begin spawning at an enormous rate, and if we keep sending parallel messages for stage 1, the computation nodes spend most their time processing messages for stage 1 (Rabbit has no concept of priority). There is a need for the control system to detect starvation and alleviate it (a variety of different ways we can do this).

  • recursive job types - our data is hierarchical. A user can categorize components of what they "own" in groups and sub-groups, and may want to be able to break down aggregated information at each level. Each job may need to perform itself separately on a child node, which may in turn need to perform itself on its own children, etc. Having some sort of way to easily specify recursive sub-jobs saves a ton of time.


What I ended up building is a system that takes a high-level description of a distributed job and controls that job across the different nodes. Rather than having each endpoint communicate directly with one another, they communicate with the central controller that tracks their progress and makes decisions on what control points to activate. The huge benefit of this is that it is a separation of concerns: the endpoints can focus on doing their particular computation well, while the "bigger picture" concerns like starvation and node failure can be handled by the control system.
The system can handle recursive job structures and in fact it can handle any type of sub-job: any job can trigger any other job as a child, including itself. This makes it trivially easy to run one component of a job so that you don't need to go through everything in order to do what needs to be done. It also allows you to remain DRY: you can abstract out certain commonly-used components and use them as a "library" of sorts to compose more complex jobs.

The code is not currently available, however we are trying to figure out the legal implications of open-sourcing the software. Ideally we'll figure all this out in the near future, and I'll be happy to release it for everyone to play with.

Shameless company promotion: If this type of work interests you, send me a message. Like most dev shops, we're always happy to bring in smart folks.

Feb 21, 2013

Typing as seen by Fanboys

Thought this was kind of funny:

Feb 27, 2012

Vim Relative Line Numbers

A little trick I learned the other day at Montreal.rb: relative-line numbering (can be enabled with :set rnu). The current line number is zero, and the left-hand column shows the distance from your cursor. This way you can easily do commands like d5k to delete the 5 lines above the cursor, or d5j to delete the 5 lines below it.

Feb 24, 2012

Asynchronous Function Looping in C#

It is often the case in code that you have to do several things in a sequence since each computation is dependent on the one(s) before it:
// ...
// stuff 1
// ...
// stuff 2
// ...
// stuff 3
// ... etc.
Good software techniques will tell you that you should break some of these up into methods:

stuff1();
stuff2();
stuff3();
If it gets big, you can even put it all in a collection and iterate (we're starting to get into weird coding now, I don't think anybody would actually do this):
var collection = new List<Action>() { stuff1, stuff2, stuff3 };
foreach (var func in collection){
func();
}
Now the part where this would actually be useful. What if some of these functions could potentially be asynchronous? That is, they depend on some value that may not be readily available - maybe user input, maybe some data from a network, etc. Blocking is not usually a great option - a modal dialog demands that the user pays attention to it even if there is something more important somewhere else. It would be better if this computation could "pause" and then resume later on when we get what we need. In some languages including Scheme and Ruby, you can accomplish this using a construct called callcc:
var collection = new List<Action<Action>>() { stuff1, stuff2,
stuff3 };
foreach (var func in collection){
// pseudo-code warning
call_cc(func);
}
Here, call_cc() will call func and pass in a function which will start executing right after the call_cc() call: it is a continuation of the loop. When func is done (or when it receives the response it wants), it can call this function to nicely continue executing the loop.

Unfortunately, C# 4.0 and lower do not support anything like callcc. C# 5.0 will support the await and async keywords which will accomplish exactly what we want, but for the time being we'll have to make do with what we have. How can we do that without callcc?

Let's give it a shot using a recursive function:
void AsyncForeach(IEnumerator<Action<Action>> iter){
if (iter.MoveNext()){
iter.Current( () => {
AsyncForeach(iter);
});
}
}
void OtherFunc(){
// ...
var collection = new List<Action<Action>>() { stuff1, stuff2,
stuff3 };

AsyncForeach(collection.GetEnumerator());
}
This would require every function in collection adhere to the Action<Action> delegate and when it is done, it will need to call the continuation manually in order to resume the computation. This is a bit annoying, and it's why all the BeginConnect, BeginSend, etc. in System.Net require an AsyncCallback to call when they are done. The new async and await keywords will be extremely useful to accomplish our task since everything is called automatically:
var collection = new List<Action>() { stuff1, stuff2, stuff3 };
foreach (var func in collection){
// func doesn't even need to call anything to
// keep this thing going!
await func();
}
It is useful to learn this from approach though. Say we want to halt the loop prematurely from within one of the functions. In that case, the function could simply not call the continuation. That would end our recursion, causing us to break out of our loop - the equivalent of the break keyword. In order to do that with the await keyword we'd have to have some sort of exception handling system, or return type, etc.

We could go even further and implement something similar to Python's for...else construct where if break is called somewhere in the computation it will run the else block:
for i in range(10):
if i == 5:
break
else:
# this is executed
print "should run"

for i in range(10):
if i == 12:
break
else:
# this is not executed
print "should not run"
We can do this by adding failure "continuations" to our functions:
void AsyncForeach(IEnumerator<Action<Action, Action>> iter,
Action failure){
if (iter.MoveNext()){
iter.Current( () => {
AsyncForeach(iter, failure);
}, failure);
}
}
void OtherFunc(){
// ...
var collection =
new List<Action<Action, Action>>() { stuff1, stuff2, stuff3 };

AsyncForeach(collection.GetEnumerator(), () => {
// handle failure
});
}
In this case the functions stuff1, stuff2, etc. will call the first function if they should continue looping, or call the second one in case of failure.

There's one final tweak to all of this. At the moment there are two problems with AsyncForeach: it depends on the type of the list we're iterating over (IEnumerator<Action<Action, Action>>), and it does not close over any variables that we may need for the loop. Can we do this using a closure?

In fact, we can:
var collection =
new List<Action<Action, Action>>() { stuff1, stuff2, stuff3 };

// declare looper early so that it closes over itself
Action looper = null;
var iter = collection.GetEnumerator();
looper = () => {
if (iter.MoveNext()){
iter.Current(looper, () => {
// handle failure
});
}
};
// don't forget to start the loop
looper();
Since this isn't so DRY, we can top it all off with a function that returns a function:
Action GetAsyncForeach<T>(IEnumerable<T> collection,
Action<T> body){
var iter = collection.GetEnumerator();
return () => {
if (iter.MoveNext()){
body(iter.Current);
}
};
}

void OtherFunc(){
// ...
var collection =
new List<Action<Action, Action>>() { stuff1, stuff2, stuff3 };

Action checker = null;
checker = GetAsyncForeach(collection, (current) => {
current(checker, () => {
// handle failure
});
});
checker(); // start the loop
}
We now have a DRY, re-usable component for implementing an asynchronous foreach loop in our code. It's not the most elegant approach, but it works really well and we don't need that much extra boiler plate to get this done (if C# supported a let rec keyword, we could make it even shorter!).

This is a useful method of looping through some asynchronous tasks that you may have to do. I found myself needing this sort of thing when calling ShowDialog would lock the entire GUI while the system waited for the user to input something, however sometimes the user would have to attend to something else before responding to the dialog. Since later actions in the loop depended on the result of the dialog box, a more asynchronous method was necessary.

Ultimately, this is why I believe that all programmers should have some experience with functional programming; this is a technique that would be obvious to a programmer in Lisp or OCaml but might be a bit trickier to someone who just has OO experience. Having functional programming know-how in your toolbelt will make you a better C# programmer.

Feb 12, 2012

Non-deterministic Programming - Amb

I've been very slowly plowing through SICP and I've recently read through their chapter on non-deterministic programming. When you program this way your variables no longer have just one value, they can take on all of their possible values until stated otherwise. An example:
x = amb(1, 2, 3, 4)
In this case x is all of 1, 2, 3, and 4. If you try to print out x it will print out 1 because the act of printing it temporarily forces a value, but otherwise you can treat the variable as though it had all of those values.

You can then force certain subsets of the values with assertions:
assert x.odd?
In this case x would become just 1 and 3. If you then added a final assertion that x > 2 you would force a single value and x would be 3. If you instead added the assertion x > 3 then x would have no values: an exception would be thrown saying that x is basically "impossible".

This is useful when you are searching for something. Suppose you're trying to find numbers that satisfy Pythagorus' theorem:
a = amb(*1..10)
b = amb(*1..10)
c = amb(*1..10)

assert a**2 + b**2 == c**2

puts a, b, c
next_value
puts a, b, c
This code would print out 3, 4, and 5 on the first output, followed by 6, 8, and 10 on the second. The next_value function would tell the amb system to find another solution to the set of variables that satisfy the assertions we specified. If nothing else is found, an exception will be thrown.

Even more interesting, there is a library in Ruby called amb that implements this stuff. Unfortunately it doesn't work just like the above example, you can only get the either the first set of values that fit, or all of them:
require 'rubygems'
require 'amb'
include Amb::Operator

a = amb(*1..10)
b = amb(*1..10)
c = amb(*1..10)

# calling amb with no arguments causes it
# to backtrack until the criteria is met
amb unless a*a + b*b == c*c

# prints out the first match
puts "#{a}, #{b}, #{c}"

# prints out every match and then crashes
amb
puts "#{a}, #{b}, #{c}"
This is a pretty cool way to program, and it would be interesting to know when it might be practical to use. I tried it with a couple of problems on Project Euler but unfortunately since the backtracking method that amb uses isn't always the most efficient approach it would choke a bit. Perhaps if this gem gets some attention and some love, it might end up as something a bit more performant!

Feb 10, 2012

The Brilliant Design of Magic Ink

I've been plowing through Bret Victor's Magic Ink essay and I've noticed a set of very interesting UI elements that I really enjoy and hope to add to my style of writing (it might even be worth it to add these features to Wordpress' publishing code).

These are:
  • Anchors at every paragraph: the essay is long. When I sent it to some of the fellow programmers at work, the first thing they did was complain about how long it was. Since you don't usually read a 50 page essay in one sitting, you need a way to bookmark not only the page you are on (useless since everything is on one page) but where you are on that page. Victor takes advantage of HTML's ability to set anchor points within a page to anchor every paragraph so that when you don't really want to read anymore, you can just click on the hash next to paragraph (only visible when you mouse over the paragraph) and it will update your address bar to anchor to that paragraph. This is not exactly fancy modern technology, this entire feature just takes advantage of named anchors and the fact that when you click on one the browser doesn't reload the page. Extremely useful for long essays, yet I can't recall seeing this anywhere else.
  • Footnotes/endnotes are actually sidenotes. It's a bit annoying when you read an essay online that has footnotes/endnotes with a star or a cross or something and you have to scroll all the way to the bottom to see those notes (some posts make it easier by providing a link, but as vi users know moving your hand to the mouse is a pain). It takes effort and there is a disconnect between when you read what you are writing and navigating to wherever the note is. By putting the note on the side next to the paragraph, it is much easier to just move your eyes to look at the note rather scrolling or clicking. This comes at the cost of screen real-estate, however if you make the actual content of your essay thinner you don't lose a whole lot since there is still a flow when you're reading line-to-line.


I really like these UI tweaks because they are so incredibly simple, yet somehow are not very common practice. If ever I write something that is long and actually decent enough to slog through I hope to remember these little features to help people better read my stuff.

Jan 27, 2012

Thoughts on Stanford's Online Courses

Last year I enrolled in a couple of the extremely popular Stanford online classes: Intro to AI and Machine Learning. While many other people who took the courses also wrote on the topic I feel like none of the descriptions quite matched how I feel about the courses, so I think I will talk about them a little bit here.

First, the good things:
  • The connection - I feel like in a course like this you have a much closer connection to the professor than you do in recorded videos like the MIT OpenCourseWare lectures or even a real classroom (with the exception of smaller class sizes) since the professor is talking directly to the camera or the camera is filming some sort of drawing surface. The professors were all extremely passionate about their topics and talked about them with so much enthusiasm (especially Sebastian Thrun) that I couldn't help but feel more enthusiastic about the topics as well - although I didn't really need the help, I found the courses very interesting! Despite the fact that in reality the video method of teaching is extremely impersonal, it seems much more personal because it is like a one-on-one enviroment where the professor is talking just to you instead of an entire classroom. Most of the time this makes you feel like you have to pay attention! Compare this to a class of 100+ students where if you fall asleep, zone out, or start doodling it's not likely the professor will notice or care.

  • The flexibility - Picture scenario one: you're sitting in your living room in a comfy chair covered in a warm blanket sipping delicious beer. You're watching the online videos on your laptop while occasionally pausing the video to go get a snack, do some chores, or check on dinner.
    Now picture scenario two: you're sitting in a hard plastic chair in a windowless room dimly lit by the projection of PowerPoint slides onto the wall (you've all had a class like that, don't lie) at 8am in the morning. Some guy is chewing loudly in the seat behind you while someone, somewhere is rapidly tapping their pen against the desk.
    Which enviroment would you prefer? While these are two extreme examples, personally I would prefer the former.
  • The technology - I found the websites modern and very easy to navigate with no clutter distracting you from what you were looking for. The two main hiccups that I noticed with the technology:
    • With the AI class the videos wouldn't always skip ahead to the next video or to the quiz - probably because I was watching the videos under Linux. The first time it happened it took a little while to realize that you could click on the question mark to access the quiz.
    • With the ML class the video would appear on a "pop-up" div inside the window and the background would fade out, but if you clicked on the background it would close the video. This was annoying when you flip to another window but then click on the browser to re-focus it and end up killing the video.
    These aren't huge problems, just little bugs or UI quirks that are expected with new software. I really enjoyed the submission system for the ML class where you could submit your assignments directly from within Octave by typing submit. I've had some programming classes do this type of thing and it really makes it easy to get your work done.



Some things I found so-so:
  • The content - I'm going to be a little harsh. The content was very interesting for both courses, however I found myself disappointed with the depth of the knowledge. It seemed like we were only scratching the surface of the topics without going into the detail or rigour that I would expect from a full university course. Then again, this is an introductory course and it has been a long time since I took an introductory course to anything. Maybe I'm just biased.



Inevitably, the downsides:
  • The assignments - I'm a person that learns by doing. I can see the things in the lecture and understand them fully, but everything tends to go in one ear and right out the other. When it comes to applying the concepts learned in the lecture I need to do it myself before the material really sinks in. In the AI class I found that the only assignment/quiz that I really connected with was the optional assignment for a simple decryption problem using NLP and statistics that was presented at the very end of the course. The rest of the assignments were just quizzes and while you do learn something by having to answer questions, I felt that the learning wasn't quite as deep on these topics as it could have been if we were actually supposed to program something.
    With the ML class they actually did have programming assignments in addition to quizzes, however with those I found that most of the time all you really had to do was translate the equations from the ones written in the assignment description to Octave and it would work. You weren't really learning and understanding what was going on, you were essentially just copying and pasting. I feel like I would have learned a lot more if the assignments had a bit more challenge to them.
  • Lack of interaction - Some of my favourite classes involved a good deal of interaction between the professor and/or the other members of the class (seminar or lab classes come to mind). In these types of classes the interaction component is a huge benefit to learning compared to the StackOverflow-style Q&A forums that they had for the AI class. Because of the online nature of the courses, there isn't really a solution to this problem that I'm aware of.


I will most certainly continue to spend free time taking these online classes, unfortunately there are too many courses that I would like to take and not enough time to take them (I had this same problem back when I was in university). One thing that would be great to see is a Khan Academy style approach where you can take the class whenever you want instead of just during the semesters that they are offered.

In the very unlikely chance that any of the Stanford profs read this, thanks! I really have enjoyed the classes and will continue taking them for a long time yet.