Just over a year ago I joined another startup called sweetiQ. It's a step up from the other startups that I've worked for in that it has actually gone live and it is actually making money. It's also interesting because this is the longest amount of time I've spent working for a single company.
The system uses RabbitMQ for driving a distributed computing cluster that essentially implements Actor model of computation, except rather than working with threads within a single machine we're using multiple processes on multiple machines (100+). Each process has a set of endpoints that listen on different queues, RabbitMQ manages dispatching these messages to the different worker nodes.
One of my major contributions is what I'm calling a "distributed job control system." There are many steps in each "job" that the system must do, and each step of the job may be handled by a different process across different machines. As a particular computing job becomes more complex, several problems arise:
- managing the dependency structure between various components of each job. In a simple sequential system you can do A, then B, then C; the dependency structure is very simple: if x depends on y, do y after x. In a distributed system B may not depend on A, so you can do them in parallel, but C depends on the output of both A and B so the system cannot start C until both A and B are done. It is not possible to just have the node handling C wait for a response of both A and B, because the messages from A and B may not be delivered to the same node that can handle C. Even more complex, in certain cases C may not need the computation from A but in other times it does - for example if we're aggregating social data from different networks but the user hasn't linked their Twitter account, we don't need to fetch data from Twitter.
- the possibility of failure increases - sometimes a node will lose its database connection. Sometimes a node will die (exmaple: we use Amazon spot instances that at any time can just shut down). Sometimes a node that attempts to fetch something on the Internet may fail for whatever reason (the API version of Twitter's fail whale is something that has happened relatively frequently). In this case the process needs some sort of elegant failure handling - often the solution is just to try again later. Other times we need to send a message to the user informing them that they need to take some sort of action (if a user changes their Facebook password, it will invalidate the stored OAuth token).
- need to rate-limit certain jobs - some jobs can have many sub-components done in parallel, we have encountered some problems sending out too many messages in parallel. The first and most obvious problem that we hit is that the database will choke, however once we learned how to use MongoDB properly this became a non-issue (having scaled both MySQL databases and MongoDB databases, I can tell you I am fully sold on MongoDB at this point). The bigger issue was a problem of process starvation: at peak times jobs will begin spawning at an enormous rate, and if we keep sending parallel messages for stage 1, the computation nodes spend most their time processing messages for stage 1 (Rabbit has no concept of priority). There is a need for the control system to detect starvation and alleviate it (a variety of different ways we can do this).
- recursive job types - our data is hierarchical. A user can categorize components of what they "own" in groups and sub-groups, and may want to be able to break down aggregated information at each level. Each job may need to perform itself separately on a child node, which may in turn need to perform itself on its own children, etc. Having some sort of way to easily specify recursive sub-jobs saves a ton of time.
What I ended up building is a system that takes a high-level description of a distributed job and controls that job across the different nodes. Rather than having each endpoint communicate directly with one another, they communicate with the central controller that tracks their progress and makes decisions on what control points to activate. The huge benefit of this is that it is a separation of concerns: the endpoints can focus on doing their particular computation well, while the "bigger picture" concerns like starvation and node failure can be handled by the control system.
The system can handle recursive job structures and in fact it can handle any type of sub-job: any job can trigger any other job as a child, including itself. This makes it trivially easy to run one component of a job so that you don't need to go through everything in order to do what needs to be done. It also allows you to remain DRY: you can abstract out certain commonly-used components and use them as a "library" of sorts to compose more complex jobs.
The code is not currently available, however we are trying to figure out the legal implications of open-sourcing the software. Ideally we'll figure all this out in the near future, and I'll be happy to release it for everyone to play with.
Shameless company promotion: If this type of work interests you, send me a message. Like most dev shops, we're always happy to bring in smart folks.