Monday, May 20, 2013

HotOS: Hot or not?

I didn't go to the biennial HotOS conference, which was held last week in New Mexico.

I am, however, extremely grateful to USENIX and to the sponsors and participants for making the technical session materials freely available to all.

There's a lot of dig through in the conference, and I'll share my thoughts on some of the work once I've had more of a chance to read it.

Meanwhile, Matt Welsh has stirred up a fair bit of discussion with his post-conference blog post: What I wish systems researchers would work on.

Of the 27 papers presented at the workshop, only about 2 or 3 would qualify as bold, unconventional, or truly novel research directions. The rest were basically extended abstracts of conference submissions that are either already in preparation or will be submitted in the next year or so. This is a perennial problem for HotOS, and when I chaired it in 2011 we had the same problem. So I can't fault the program committee on this one -- they have to work with the submissions they get, and often the "best" and most polished submissions represent the most mature (and hence less speculative) work.

Welsh has been involved with the conference for years; moreover, given his well-known move from member of the faculty of Harvard's Computer Science department to lead researcher at Google, he has a fascinating background in both the academic and industrial research arenas.

So his thoughts are worth considering.

And yet, I must say that I find Welsh's criticisms to ring hollow.

It doesn't seem right to critique a topic by whether it is brand new, or whether it "has been a recurring theme" that involves "problems that are 25 years old." In some ways, I think it is a mark of Computer Science's maturity that we've stopped completely changing our entire perspective every 18 months, and are instead returning to deep problems of enduring interest.

For example, the observation by an IBM research team that the last two decades have seen dramatic progress in network I/O performance but a dramatic lack of progress in storage I/O performance seems like a great topic for the operating systems community to be discussing. Why, just last week I was digging into continued research in the network I/O world, while the disk storage world appears to be still stuck in bloated, layered, horrifically complex implementations from gigantic enterprise companies that ladle on the complications while ratcheting up the price: it's not uncommon for an enterprise SAN device to cost 7 or 8 digits, as much as 1000 times the cost of the servers that are trying to process that imprisoned data.

And when Welsh outlines the areas he considers important, he notes that:

A typical Google-scale system involves many interacting jobs running very different software packages each with their own different mechanisms for runtime configuration: whether they be command-line flags, some kind of special-purpose configuration file (often in a totally custom ASCII format of some kind), or a fancy dynamically updated key-value store.
and also that:
the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.

I couldn't agree more with Welsh's points, which makes me wonder how he reacted to the session from the RAMCloud team: Toward Common Patterns for Distributed, Concurrent, Fault-Tolerant Code. The RAMCloud developers ran smack into these problems, and felt that pain:

For example, when a server failure notiļ¬cation is received by a server, several of its segments are affected, and each affected segment replica can be in a different state. Some replicas may have an RPC outstanding to the failed server and must abort the RPC and restart replication elsewhere instead of expecting a response. Other affected replicas may not be consistent with their counterparts; such replicas require contacting the cluster coordinator before starting recreation in order to prevent inconsistencies.

Now, what the RAMCloud team propose is not so new or trendy; again, they look back two decades, to the dawn of object-oriented design, and the "patterns" approaches that proved so successful in the 1990's:

Although the implementations were different in many respects, we eventually noticed a common theme; each of these modules contained a set of rules that could trigger in any order. We gradually developed a pattern for DCFT code based on three layers: rules, tasks, and pools. This particular pattern has worked for a variety of problems in RAMCloud. We believe that this pattern, or something like it, might provide a convenient way of structuring DCFT modules in general.

It's important to look to the future, but it's important to learn from and build on the past. Some old techniques are sound, and we shouldn't jettison them just because they're old (of course, remember you're getting this from a 52-year-old systems programmer who still codes 10 hours a day in C!).

So keep pushing the envelope, but let's not excoriate those who try to help us learn from proven decades-old approaches and bring that wisdom to the problems of modern software.

No comments:

Post a Comment