I was directed to a recent mailing list post by Linus Torvalds on linux-fsdevel in which he derided the concept of user-space filesystems. Not a particular implementation, mind you, but the very concept of it.
Jeff Darcy, of Red Hat and CloudFS fame, wrote a wonderful response, which you should read first before continuing further.
From my perspective, as the creator of GlusterFS, Linus is rather blinkered on this issue. The fact is, user space advantages far outweigh kernel space advantages. You’ll notice that Linus pointed to no benchmarks or studies confirming his opinion, he merely presented his bias as if it were fact. It is not.
Hypervisors are the modern micro kernels. Microkernel is not about size, but about what should be in kernel mode. Linus’s ideas about filesystems are rather old. He thinks that it is a bad idea to push the filesystems to user space, leaving the memory manager to run in kernel mode. The bulk of the memory buffers are filesystem contents, and you need both of them to work together. This is true for root filesystems with relatively small amounts of data but not true when it comes to scalable storage systems. Don’t let the kernel manage the memory for you. In my opinion, Kernel-space does a poor job of handling large amounts of memory with 4k pages. If you see the bigger picture, disks and memory have grown much larger, and user requirements have grown 1000-fold. To handle today’s scalable, highly available storage needs, filesystems need to scale across multiple commodity systems, which is much easier to do in user space. Real bottlenecks come from the network/disk latencies, buffer-copying and chatty IPC/RPC communications. Kernel-user context switches are hardly visible in the broader picture, thus whatever performance improvements it offers are irrelevant. Better, then, to use the simpler, easier methods offered in user-space to satisfy modern storage needs. Operating systems run in user-space in virtualized and cloud environments, and kernel developers should over come this mental barrier.
Once upon a time, Linus eschewed microkernels for a monolithic architecture for sake of simplicity. One would hope that he would be able to grasp the reasons why simplicity wins in this case, too. Unfortunately, he seems to have learned the wrong lesson from the microkernel vs. monolithic kernel debates: instead of the lesson being that all important stuff gets thrown into the kernel, it should have been that simplicity outweighs insignificant improvements elsewhere. We have seen this in the growth of virtualization and cloud computing, where the tradeoff between new features and performance loss has proved to be irrelevant.
There are bigger issues to address. Simplicity is *the* key to scalability. Features like online self-healing, online upgrade, online node addition/removal, HTTP based object protocol support, compression/encryption support, HDFS APIs, and certificate based security are complex in their own right. Necessitating that they be in kernel space only adds to the complexity, thus hampering progress and development. Kernel mode programming is too complex, too restrictive and unsustainable in many ways. It is hard to find kernel hackers, hard to write code and debug in kernel mode, and it is hard to handle hardware reliability when you scale out due to multiple points of failure.
GlusterFS got its inspiration from the GNU Hurd kernel. Many years before, GNU Hurd was able to mount tar balls as a filesystem, FTP as a filesystem, and POP3 as an mbox file. Users could extend the operating system in clever ways. A FUSE-like user space architecture was an inherent part of the Hurd operating system design. Instead of treating filesystems as a module of the operating system, Hurd treated Filesystems as the operating system. All parts of the operating system were developed as stackable modules, and Hurd handled hardware abstraction. Didn’t we see the benefits of converging the volume manager, software RAID and filesystem in ZFS? GNU Hurd took it a step further, and GlusterFS brought it to the next level with Linux and other Unix kernels. It treats the Linux kernel as a microkernel that handles hardware abstraction and broaches the subject that everyone is thinking, if not stating outloud: the cloud is the operating system. In this brave new world, stuffing filesystems into kernel space is counter-productive and hinders development. GlusterFS has inverted the stack, with many traditional kernel space jobs now handled in user space.
In fact, when you begin to see the cloud and distributed computing as the future (and present), you realize that the entire nomenclature of user space vs. kernel space is anachronistic. In a world where entire operating systems sit inside virtualized containers in user space, what does it even mean to be kernel space any more? Looking at the broader trends, arguing against user space filesystems is like arguing against rising and falling tides. To suggest that nothing significant is accomplished in user space is to ignore all major computing advances of the last decade.
To solve 21st-century distributed computing problems, we needed 21st-century tools for the job, and we wrote them into GlusterFS. GlusterFS manages most of the operating system functionality within its own user space, from memory management, IO scheduling, volume management, NFS, and RDMA to RAID-like distribution. For memory management, it allocates large blocks for large files, resulting in far fewer page table entries, and it is easier to garbage collect in user space. Similarly with IO scheduling, GlusterFS uses elastic hashing across nodes and IO-threads within the nodes. It can scale threads on demand and group blocks belonging to the same inodes together, eliminating disk contention. GlusterFS does a better job of managing its memory or scheduling, and the Linux kernel doesn’t have an integrated approach. It is user-space storage implementations that have scaled GNU/Linux OS beyond petabytes seamlessly. That’s not my opinion, it’s a fact: the largest deployments in the world are all user-space. Whats wrong with FUSE simplying filesystem development to the level of toy making? 🙂
Some toys are beautiful and work better than others.