All That . . . And a Pony!

Last night I gave a talk at BBLISA about GlusterFS, HekaFS, and cloud filesystems generally. It was a great time with a great group, and I thank everyone involved. One thing that did come up during the talk had to do with this slide about what I consider to be HekaFS’s most important future feature.

Someone in the audience asked, “Can I have a pony with that?” Everyone laughed, including me. I mentioned that I had actually proposed “PonyFS” as the name for CloudFS, even before the HekaFS debacle. Everyone laughed again. Here’s the thing, though: as ambitious at that agenda might seem, it’s entirely within the realm of possibility. The first three features are common across all of the Dynamo-derived data stores such as Riak, Voldemort, and Cassandra. In the filesystem world, the best known examples of doing something like this are AFS and its descendants such as Coda and Intermezzo. Bayou and Ficus are less well known, and that’s kind of sad. If you’re at all interested in this area, you must read up on those; if you can’t see how Bayou inspired half of the “new” ideas in Dynamo (and thus in all of its derivatives) then read again until you get it. Even the old-school database dullards have done this multiple times. Really, the techniques for doing that part are pretty well known.

The claim about caching as a special case of replication might be a bit more controversial, but I don’t think it’s hard to explain. With full “push mode” optimistic replication, one side assumes that the other wants some data, and won’t throw away that data once received. Both of those assumptions can be weakened. If you know don’t know that a peer wants that data, don’t send it. They might have an out-of-date copy or they might not have it all, but either way that’s explicitly what they want. If you don’t know that a peer will keep the data, don’t assume they’ve done so for purposes of ensuring your target replica count. Pulling an object into your cache thus becomes a simple matter of expressing an interest in it, which automatically triggers transmission of its current state. When you want to shut off the flow of updates, whether you’re keeping a copy or not, you tell your peers you’re no longer interest. It’s easy enough to “mix and match” full replicas with caches in this model, with only full replicas considered for data-protection purposes but updates also pushed to caches when that’s appropriate. Once you’re using a common framework, it’s even possible to do fancier things like express interest only if the last transmitted version is more than N seconds old.

If it were all just this simple, though, people wouldn’t have laughed. In fact, I wouldn’t have been showing this slide because it wouldn’t be new functionality. The reason it’s not common already is that asynchronous write-anywhere replication doesn’t come for free. Some tradeoff has to be made, and anyone who has seen or heard me go on about the CAP Theorem has probably already guessed what that is: consistency. A system like this has lots of nice features, but strong or even predictable consistency is not one of them. Sites will be out of date with respect to one another, perhaps only for milliseconds or perhaps for days if there’s a major network disruption. For some applications that’s unacceptable; for others it’s well worth it because of the performance and availability advantages of doing things this way. Some people in the second group don’t even care if the replication maintains any semblance of the original operation order, but I’d say that the most common need is for replication that’s asynchronous but still ordered to some degree. There’s a whole continuum here. At one extreme you have total ordering across an entire dataset (e.g. filesystem volume). This is almost as strong a guarantee as full consistency. It’s also almost as difficult or expensive to implement. At the other extreme is ordering only within objects (e.g. files). In between you have various ways of grouping objects into multiple ordered replication streams, either “naturally” (e.g. an entire directory hierarchy becomes one stream) or by explicit individual assignment of objects to streams. In HekaFS, the highly tentative plan is to support multiple named streams, with everything assigned by default to one stream per volume. Namespace operations would always remain within that default stream; if you want a completely separate stream including namespace operations, create a separate volume and mount it on top of the first. File operations can be directed into a different stream by tagging the file, or by inheriting a tag from an ancestor directory. This supports practically all of the useful ordering models from very tight to very loose, though I wouldn’t promise that the implementation will really scale up to millions files with an explicit separate stream for each.

The one thing I haven’t addressed yet is how conflicts are resolved, when clients in two places really do update the same file either simultaneously or during a single partition event. That’s a complex enough topic that it’s best left to a separate post. What I hope I’ve been able to do here is explain some of why I believe the feature set represented by that slide is both useful and achievable, and what tradeoffs or pitfalls people need to be aware of to take advantage of it (once it actually exists).