Jun 1, 2008

Frameworks Galore

It seems like one of the cool things to do these days is to roll a web framework. Now that Rails has popularized this notion, frameworks seem to be coming out of the woodwork.

My company built a framework to manage several sites at once. The framework was good, although it was rather complex.
I started work on a new high-traffic site, and the framework just didn't cut it. It didn't have great support for memcached or our distributed cluster system.

So I rolled my own framework designed for high-traffic websites. It's based off the MVC (Model-View-Controller) architecture and uses the active record pattern for working with the database. Much of this functionality was inspired by Rails. Oh, and unfortunately it's done with PHP.

I drop some of Rails' "convention-over-configuration" idea. This is because when you have a high traffic site that is heavy in the DB usage, you can't just go "set my username to this, set my password to this, and choose this DB". You have to actually set up connections with some logic behind them. My system uses clustered MySQL databases in a master-slave architecture, with vertical partitioning between tables. This is when you split up a table by its columns. Sometimes the tables are put across different database clusters. This means a few things. You can't do writes (UPDATE/INSERT/DELETE) to a slave database, and sometimes you can't use JOIN, if the two tables are across separate databases. So the framework has to support both picking a connection based on the need for a master or slave, and it has to pick a connection based on which table you want to query.
Since choosing these databases are rather complex, I set it up so that there is an abstract DB class that manages the basic things, and you inherit from it in order to get more specific functionality. I have a default inherited class that uses a table map that maps from a table name to a connection number. Then when the base DB class requests a connection from the derived class (for a query or whatever), the inherited class will figure out what connection matches to the table and the need for a master or slave. It uses lazy loading for the connection resources so that a connection to a particular database is not made until the base DB class actually requests it.

The other major change is around using memcached. I set it up so that model objects may automatically use memcached as a data store alongside the database, by inheriting from the CachedModel class instead of the Model class (the CachedModel class inherits from the Model class anyway, so you still get the functionality of Model). Then the rows for this model are stored in memcached when they are requested, and the memcached version gets modified when the row gets modified. In fact, the CachedModel version of save() (which performs an INSERT/UPDATE normally) by default does not save directly to the database unless a particular heuristic decides that it is time to save (this can be overridden by passing true to save() ). This means that there are actually very little writes going to the database and it is instead going directly to memcached. It makes it far faster, and you notice live updates on the site. For example if you have a profile_viewed field for a user, you can actually sit there refreshing and you'll see that the count increases. Since the object is stored in a central location, any other place that the profile_viewed field is displayed will also increase.
EDIT: As the site grew, an issue with race conditions came up. This is because there is an amount of time between when the object is taken out of memcached, processed by a script and put back in. If the script is being accessed often enough, then you run into problems as data can get clobbered. I'm not working for the same company any more so I'm not sure what they did to fix this, but I assume it had something to do with memcached's atomic increment function. You'd have to slightly tweak the CachedModel object to add this functionality.
Queries can also be stored in memcached. When you call findAll(), you have an option to pass a cache name. If you do, it is stored in memcached under that name. To save on space, this query only fetches the id from the database, and then fetches the objects themselves from memcached (if they are there, otherwise it gets the from the database).
Note that all Model/CachedModel objects need an id field as the primary key. This makes the framework much simpler. If you have a table that does not follow this pattern (like a friends table, or favourite photos) then you'll have to use SQL. Fortunately each class inherited from Model gives you a findBySQL() function which returns objects of that type.

The views are very basic, the controller object has a data field, which is an associative array mapping the name of the view variable to its value. When you're in the view, you can access the variable itself:
Controller:
$this->data["myObj"] = "Hello";

View:
<?= $myObj ?>
I also add a plugins feature, which is like a mini-controller/view setup. It's used for small bits of functionality that you might use across your site. An example would be Facebook's commenting. You can comment on a profile (the wall), on photos, on videos, etc. This would be wrapped up a plugin so that in your controller you just go loadPlugin("pluginName", new MyPlugin()) and in the view: $this->plugins["pluginName"]->display(). It's handy.

Finally one other thing I think is cool about is it supports what I call shared applications. This means that you can have an application in one place, another application in another place, and have them share common things like model objects, plugins, etc. but have different controllers/views, connect using different database credentials, etc.

This framework is not really designed to "baby" the programmer. If you're afraid of SQL or command-line, then it's probably not for you. There are many cases in this framework when you'd have to use SQL, and other cases where you might have to go to the command line. However if you're working on a very high traffic site, you should be comfortable with these things anyway.

Unfortunately at the moment I can't release the framework as it is owned by my company and not by me. However I'm planning on rewriting it under an open-source license, probably MIT. If anybody has some suggestions, feel free to let me know.

2 comments:

Guillaume Theoret said...

Pretty cool. Basically you implemented an identity map =)

That was the next step for the framework Tim and I wrote. (Since ours really only does query caching)

You said that you store all the objects in a centralized location so do you have only one centralized memcache server that all web servers read and update or do you split the objects across multiple servers?

The other way is to run a memcache daemon on each local server and one (or more) centralized memcache servers to write to that will then broadcast all writes to individual servers,

So imagine you're using 10 servers. You might want 2 centralized memcache machines in case of failover. Someone connects and ends up on server 5. He does a write (page view) and server 5 updates its copy (so that the number increments immediately) and asynchronously sends a message to memcache1 to update its copy as well. memcache1 updates its copy and broadcasts the new value to all 10 servers and memcache2. This way all reads are local and all writes are distributed. If a write to memcache1 fails you write to memcache2.

Rob Britton said...

The framework doesn't really care about how the memcached servers are allocated, although it does require that all the pools are synchronized somehow.

Although I'm not 100% sure how it works - I didn't set it up - our site uses a distributed memcached system across the web servers. I think the server that each object is on depends on the key of that object.

The problem with having all the data on all the servers is that there is some synchronization delay (usually a few seconds but can be several minutes at peak times), and also there is a lot of redundancy. We ran into an issue where the memcached servers actually ran out of memory, which is why we switched over to the distributed system.

We have it set up so that every time vital info gets saved, it gets saved to the database in addition to memcached so that if there is a memcached failure, it doesn't really matter. Things like profile views or pages viewed isn't completely vital, so in the case that there is some data loss I'm not too concerned.