Ara Howard: Ruby Queue

Phil Windley // Fri Mar 16 09:58:00 2007 // ruby supercomputing

Ara T Howard
(click to enlarge)

Ara T Howard, a research associate at The Cooperative Institute for Research in Environmental Sciences, is speaking about Ruby Queue, a tool for distributing the workload to nodes in a Linux cluster. He wanted something lean and fast and considered the existing packages like openMosix too heavyweight.

The queue doesn't do scheduling--the type of work his group does processes long jobs that use lots of nodes. He determined to build something extremely simple, an NFS mounted priority queue that nodes could pull jobs from as needed. NFS has lots of cruftiness, but the truth was that they were already using it for other tasks, so one more probably wasn't a big increase in risk.

The first issue was whether Ruby could handle the NFS-safe locking for the shared storage. He wrote a "little C extension, posixlock, which extends Ruby's built-in File class with a method to apply fcntl, or POSIX-style, advisory locks to a File object." This is the standard way of making languages that aren't C do C-like things. I've done the same thing in Scheme.

Ara noticed that NFS locking wasn't fair. That means that you have to be smart about how you attempt to get the lock or processes can block. He implemented a leasing policy (that helps avoid starvation) and a method for polling to increasing back-off of requests. His studies show that over a long period (days) processes get good interleaving and the system achieves good throughput.

Ara used SQLite, along with the SQLite Ruby bindings as the back end storage system. He discovered that this combination is very robust and recovers from errors very well.

When a node accesses the queue and figures out what to do, the job needs to be run, but some technical difficulties with SQLite made forking a non-starter. Ara used Distributed Ruby to run the job, essentially forking, without forking. A simple JobRunnerDaemon object encapsulates the functionality.

One little piece of NFS-fu that Ara mentioned that I thought was pretty cool (nothing to do with Ruby) is his use of hard links to robustly pause the whole cluster when the NFS serve goes down. Anyone with NFS experience will know that you use soft links on your workstation so that it doesn't hang when the NFS server goes down. But that's the precise behavior that Ara wanted.

Ara's Linux Journal article has a good example showing how this is all used. It's easy to see how you could grab this could and in short order take a collection of Linux boxes, linked by NFS, to create a cluster that processes large, compute intensive jobs. Very cool.

The whole project is a good example of the elegance of Ruby. From the ability to extend the language with C to the way mixins make it easy to augment functionality of classes.