What Can a Paper Shredder Teach Us About Big Data?

by Irshad Raihan, Red Hat Storage – Big Data Product Marketing

The trusty paper shredder in my home office died last week. I’m in the market for a new one. Years ago, when I purchased “Shreddy” (of course, it had a name) after a brief conversation with a random store clerk, choices were few and information scarce. In fact, paper shredders weren’t really considered standard personal office equipment as they are today. Most good shredders were built for offices not homes. Back in the market more than a decade later, it’s clear that the search for a new shredder is going to be trickier than I had imagined.

A paper shredder is a lot like big data.

Continue reading “What Can a Paper Shredder Teach Us About Big Data?”

Why Thomas Jefferson could predict the next big thing in storage

In 1814, Thomas Jefferson donated the contents of his vast personal library of books and correspondence to form the foundation of the Library of Congress. Some 200 years later, that library is one of the largest in the world. Yet, the text of all of its contents could fit on a stack of DVDs that would reach to the top of a two-story building.

read more

Clouds Cannot Be Contained In A Box

A big part of the value proposition of cloud is to ensure that you have continuous access to your data, and that you’ve moved beyond the physical limitations of a single box or a single data center or a single geography. While the move to the cloud can allow greater leverage of compute servers and storage, it also provides the ability to move away from aging, monolithic storage and servers, and gives cloud customers access to their data irrespective of any technical issues that may be going on and irrespective of their physical location. Cloud is supposed to be always on, with resources available on-demand, 24×7. However, how do you deliver all of this with a cloud that’s been imprisoned in a box? Clouds cannot be contained in a box.

Quorum Enforcement

As of yesterday, my most significant patch yet became a real part of GlusterFS. It’s not a big patch, but it’s significant because what it adds is enforcement of quorum for writes. In operational terms, what this means is that – if you turn quorum enforcement on – the probability of “split brain” problems is greatly reduced. It’s not eliminated entirely, because clients don’t see failures at the same time and might take actions that lead to split brain during that immediate post-failure interval. There are also some failure conditions that can cause clients to have persistently inconsistent models of who has quorum and who doesn’t. Still, for 99% of failures this will significantly reduce the number of files affects by split brain – often down to zero. What will happen instead is that clients attempting writes (actually any modifying operation) without quorum will get EROFS instead. That might cause the application to blow up; if that’s worse for you than split brain would be, then just don’t enable quorum enforcement. Otherwise, you have the option to avoid or reduce one of the more pernicious problems that affect GlusterFS deployments with replication.

There’s another significant implication that might be of interest to those who follow my other blog. As such readers would know, I’m an active participant in the endless debates about Brewer’s CAP Conjecture (I’ve decided that Gilbert and Lynch’s later Theorem is actively harmful to understanding of the issues involved). In the past, GlusterFS has been a bit of a mess in CAP terms. It’s basically AP, in that it preserves availability and partition tolerance as I apply those terms, but with very weak conflict resolution. If only one side wrote to a file, there’s not really a conflict. When there is a conflict within a file, GlusterFS doesn’t really have the information it needs to reconstruct a consistent sequence of events, so it has to fall back on things like sizes and modification times (it does a lot better for directory changes). In a word, ick. What quorum enforcement does is turn GlusterFS into a CP system. That’s not to say I like CP better than AP – on the contrary, my long-term plan is to implement the infrastructure needed for AP replication with proper conflict resolution – but I think many will prefer the predictable and well understood CP behavior with quorum enforcement to the AP behavior that’s there now. Since it was easy enough to implement, why not give people the choice?

The beauty of little boxes, and their role in the future of storage

If you’re into 1960’s songs about middle class conformity, you may not have a positive association with lots of interchangeable “little boxes.” In storage, however, those little boxes are not only beautiful but the wave of the future. Insider (free registration required)

read more

License Change

As of a few minutes ago, the license on HekaFS changed from AGPLv3+ to GPLv3 – not AGPL, not LGPL, not later versions. This only affects the git repository so far; packages with the change still need to be built, then those need to be pushed into yum repositories, and all of that will take some time.

Why the change? It actually had very little to do with the acquisition; Gluster themselves had already moved to GPLv3+ and the plan has always been for the HekaFS license to track that for GlusterFS. What the acquisition did was spur a general conversation about what license should apply to both GlusterFS and HekaFS (as long as it remains separate). After several rounds of this, I was told it should be GPLv3, and so it is. While I’ve personally gone from favoring BSD/MIT to favoring A/GPL, I actually believe they’re all fine. Even though I’ve argued on my own blog about why AGPL is what GPL should be, I’ve also seen actual cases where AGPL-aversion has threatened to kill projects. It doesn’t matter what I think about the AGPL’s effect on others’ code, or even what the legal outcome if/when there’s a proper test case to set precedent. The fact is that the engineers who are trying to use the code can’t change a no-AGPL policy, and the people who make such policies have their reasons. As far as I know, that’s why Gluster had already abandoned AGPL. As to why it’s GPL instead of LGPL, or v3 instead of v2 or v3+ . . . well, I don’t know. The differences at that point are below my threshold of caring, so I didn’t even ask.

Translator 101 Class 1: Setting the Stage

This is the first post in a series that will explain some of the details of writing a GlusterFS translator, using some actual code to illustrate.

Before we begin, a word about environments. GlusterFS is over 300K lines of code spread across a few hundred files. That’s no Linux kernel or anything, but you’re still going to be navigating through a lot of code in every code-editing session, so some kind of cross-referencing is essential. I use cscope with the vim bindings, and if I couldn’t do “crtl-\ g” and such to jump between definitions all the time my productivity would be cut in half. You may prefer different tools, but as I go through these examples you’ll need something functionally similar to follow on. OK, on with the show.

The first thing you need to know is that translators are not just bags of functions and variables. They need to have a very definite internal structure so that the translator-loading code can figure out where all the pieces are. The way it does this is to use dlsym to look for specific names within your shared-object file, as follow (from xlator.c):

        if (!(xl->fops = dlsym (handle, "fops"))) {
                gf_log ("xlator", GF_LOG_WARNING, "dlsym(fops) on %s",
                        dlerror ());
                goto out;
        }
 
        if (!(xl->cbks = dlsym (handle, "cbks"))) {
                gf_log ("xlator", GF_LOG_WARNING, "dlsym(cbks) on %s",
                        dlerror ());
                goto out;
        }
 
        if (!(xl->init = dlsym (handle, "init"))) {
                gf_log ("xlator", GF_LOG_WARNING, "dlsym(init) on %s",
                        dlerror ());
                goto out;
        }
 
        if (!(xl->fini = dlsym (handle, "fini"))) {
                gf_log ("xlator", GF_LOG_WARNING, "dlsym(fini) on %s",
                        dlerror ());
                goto out;
        }

In this example, xl is a pointer to the in-memory object for the translator we’re loading. As you can see, it’s looking up various symbols by name in the shared object it just loaded, and storing pointers to those symbols. Some of them (e.g. init are functions, while others e.g. fops are dispatch tables containing pointers to many functions. Together, these make up the translator’s public interface.

Most of this glue or boilerplate can easily be found at the bottom of one of the source files that make up each translator. We’re going to use the rot-13 translator just for fun, so in this case you’d look in rot-13.c to see this:

struct xlator_fops fops = {
        .readv        = rot13_readv,
        .writev       = rot13_writev
};
 
struct xlator_cbks cbks = {
};
 
struct volume_options options[] = {
        { .key  = {"encrypt-write"},
          .type = GF_OPTION_TYPE_BOOL
        },
        { .key  = {"decrypt-read"},
          .type = GF_OPTION_TYPE_BOOL
        },
        { .key  = {NULL} },
};

The fops table, defined in xlator.h, is one of the most important pieces. This table contains a pointer to each of the filesystem functions that your translator might implement – open, read, stat, chmod, and so on. There are 82 such functions in all, but don’t worry; any that you don’t specify here will be see as null and filled with defaults from defaults.c when your translator is loaded. In this particular example, since rot-13 is an exceptionally simple translator, we only fill in two entries for readv and writev.

There are actually two other tables, also required to have predefined names, that are also used to find translator functions: cbks (which is empty in this snippet) and dumpops (which is missing entirely). The first of these specify entry points for when inodes are forgotten or file descriptors are released. In other words, they’re destructors for objects in which your translator might have an interest. Mostly you can ignore them, because the default behavior handles even the simpler cases of translator-specific inode/fd context automatically. However, if the context you attach is a complex structure requiring complex cleanup, you’ll need to supply these functions. As for dumpops, that’s just used if you want to provide functions to pretty-print various structures in logs. I’ve never used it myself, though I probably should. What’s noteworthy here is that we don’t even define dumpops. That’s because all of the functions that might use these dispatch functions will check for xl->dumpops being NULL before calling through it. This is in sharp contrast to the behavior for fops and cbks, which must be present. If they’re not, translator loading will fail because these pointers are not checked every time and if they’re NULL then we’ll segfault. That’s why we provide an empty definition for cbks; it’s OK for the individual function pointers to be NULL, but not for the whole table to be absent.

The last piece I’ll cover today is options. As you can see, this is a table of translator-specific option names and some information about their types. GlusterFS actually provides a pretty rich set of types (volume_option_type_t in options.h) which includes paths, translator names, percentages, and times in addition to the obvious integers and strings. Also, the volume_option_t structure can include information about alternate names, min/max/default values, enumerated string values, and descriptions. We don’t see any of these here, so let’s take a quick look at some more complex examples from afr.c and then come back to rot-13.

        { .key  = {"data-self-heal-algorithm"},
          .type = GF_OPTION_TYPE_STR,
          .default_value = "",
          .description   = "Select between \"full\", \"diff\". The "
                           "\"full\" algorithm copies the entire file from "
                           "source to sink. The \"diff\" algorithm copies to "
                           "sink only those blocks whose checksums don't match "
                           "with those of source.",
          .value = { "diff", "full", "" }
        },
        { .key  = {"data-self-heal-window-size"},
          .type = GF_OPTION_TYPE_INT,
          .min  = 1,
          .max  = 1024,
          .default_value = "1",
          .description = "Maximum number blocks per file for which self-heal "
                         "process would be applied simultaneously."
        },

When your translator is loaded, all of this information is used to parse the options actually provided in the volfile, and then the result is turned into a dictionary and stored as xl->options. This dictionary is then processed by your init function, which you can see being looked up in the first code fragment above. We’re only going to look at a small part of the rot-13′s init for now.

        priv->decrypt_read = 1;
        priv->encrypt_write = 1;
 
        data = dict_get (this->options, "encrypt-write");
        if (data) {
                if (gf_string2boolean (data->data, &priv->encrypt_write) == -1) {
                        gf_log (this->name, GF_LOG_ERROR,
                                "encrypt-write takes only boolean options");
                        return -1;
                }
        }

What we can see here is that we’re setting some defaults in our priv structure, then looking to see if an “encrypt-write” option was actually provided. If so, we convert and store it. This is a pretty classic use of dict_get to fetch a field from a dictionary, and of using one of many conversion functions in common-utils.c to convert data->data into something we can use.

So far we’ve covered the basic of how a translator gets loaded, how we find its various parts, and how we process its options. In my next Translator 101 post, we’ll go a little deeper into other things that init and its companion fini might do, and how some other fields in our xlator_t structure (commonly referred to as this) are commonly used.

Who Cares About Storage?

Yesterday, a customer asked whether some of the future directions for GlusterFS might result in less need for storage or system administrators. As part of my reply I hit on a theme that seemed to resonate well enough that it’s worth expanding on it a bit more. By the way, it’s not an original theme. It’s just one that doesn’t seem to get enough attention.

The whole idea of a “storage administrator” is becoming outdated. “Storage” implies stasis. In common usage, when you store something it just sits there, not moving, often at some distance from its point of use, for some considerable time. In the technical world, “storage” is often used to mean moving data to/from one device, subject to the performance and other constraints of that one device. You might have many such devices, they might use RAID or some other kind of redundancy internally, but this model of storage is still about data in repose and in isolation. The problem is that all of the interesting problems nowadays are about the data as it moves from place to place. Storage as a specialty is about little islands of data; data as a specialty is about many heterogeneous islands and transport between them. Storage performance is about feeds and speeds in/out of a box; data performance is about speed to/from whichever box is most convenient, accounting for all kinds of replicas and caches within the data layer. As data need outpaces data speed by ever greater margins, a data administrator must make ever more sophisticated decisions about which data should move through which pipes to which locations, and when. Managing data integrity becomes inextricably entwined with managing consistency across multiple copies. That brings with it a whole host of difficult problems and tradeoffs, and I haven’t even gotten to security or format/protocol issues yet. Just as there’s a difference between low-level networking (e.g. Ethernet/IP) knowledge and higher-level distributed systems knowledge, there’s also a difference between storage expertise and data expertise. We need to make the storage part simpler precisely because a data administrator has to understand all of this other stuff as well. Humans should be handling the policies and exceptions that machines can’t, not bogged down managing the mere mechanics of something that the data layer should be able to do autonomously.

Say goodbye to the storage administrator. Say hello to the data administrator. What, you say they look the same? Exactly.

Asteroids, nuclear war and data center outages: Surviving big disasters by being small

Imagine that it’s the height of the Cold War, and you are trying to design an approach to command, control, and communications that can survive a full-scale nuclear attack.

One approach is to build a small number of communications nodes that are highly resilient. For example, you can build communications bunkers a mile deep under mountains, or keep a pair of specially-outfitted jets continuously in the air above Nebraska. Let’s call this the Big Box approach.

read more

The Future of HekaFS

A lot of people have asked me about what the acquisition of Gluster by Red Hat means for HekaFS. In the interests of transparency I’m going to share a bit of my thinking on that, but I should be extra-careful to note that none of this represents the official position or plans of Red Hat. These are just my own personal predictions, or even hopes. I can influence the official direction to some extent, but I’m like the tenth guy down the totem pole when it comes to making actual decisions.

Continue reading “The Future of HekaFS”

Taming the BEAST

By now, a lot of people have heard of BEAST, which is an attack against the AES-CBC encryption used in SSL. Some people might also have noticed that the HekaFS git sources include “aes” and “cbc” branches which represent two different implementations of a new at-rest encryption method to replace the weak AES-CTR version that we’re using as a placeholder, and those people might wonder whether we share the BEAST vulnerability. Short answer: we don’t. While Edward’s “aes” branch might implement real CBC, my “cbc” branch does not. Yeah, I know that’s confusing. Simply put, I use some of the “xxx_cbc” entry points for convenience, but only for one cipherblock at a time so there’s no actual chaining involved. One correspondent has already pointed out – correctly – that “cbc” is a misnomer for what’s really tweaked ECB. Our scheme is actually pretty similar to LRW, but it uses a hash and a unique (per file) salt instead of Galois-field multiplication. It was designed to defeat a completely different attack (modification in one ciphertext block leading to a predictable change in the next plaintext block), but it also avoids the guessable-IV flaw that is the basis of BEAST.

Continue reading “Taming the BEAST”

Storage is a hard problem with a soft(ware) solution

My wife and I are both dog people, but we have mixed views regarding another contentious issue: hardware versus software.

read more