Linus Torvalds doesn't understand user-space filesystems

I was directed to a recent mailing list post by Linus Torvalds on linux-fsdevel in which he derided the concept of user-space filesystems. Not a particular implementation, mind you, but the very concept of it.

Jeff Darcy, of Red Hat and CloudFS fame, wrote a wonderful response, which you should read first before continuing further.

From my perspective, as the creator of GlusterFS, Linus is rather blinkered on this issue. The fact is, user space advantages far outweigh kernel space advantages. You’ll notice that Linus pointed to no benchmarks or studies confirming his opinion, he merely presented his bias as if it were fact. It is not.

Hypervisors are the modern micro kernels. Microkernel is not about size, but about what should be in kernel mode. Linus’s ideas about filesystems are rather old. He thinks that it is a bad idea to push the filesystems to user space, leaving the memory manager to run in kernel mode. The bulk of the memory buffers are filesystem contents, and you need both of them to work together.  This is true for root filesystems with relatively small amounts of data but not true when it comes to scalable storage systems. Don’t let the kernel manage the memory for you. In my opinion, Kernel-space does a poor job of handling large amounts of memory with 4k pages. If you see the bigger picture, disks and memory have grown much larger, and user requirements have grown 1000-fold. To handle today’s scalable, highly available storage needs, filesystems need to scale across multiple commodity systems, which is much easier to do in user space. Real bottlenecks come from the network/disk latencies, buffer-copying and chatty IPC/RPC communications. Kernel-user context switches are hardly visible in the broader picture, thus whatever performance improvements it offers are irrelevant. Better, then, to use the simpler, easier methods offered in user-space to satisfy modern storage needs. Operating systems run in user-space in virtualized and cloud environments, and kernel developers should over come this mental barrier.

Once upon a time, Linus eschewed microkernels for a monolithic architecture for sake of simplicity. One would hope that he would be able to grasp the reasons why simplicity wins in this case, too. Unfortunately, he seems to have learned the wrong lesson from the microkernel vs. monolithic kernel debates: instead of the lesson being that all important stuff gets thrown into the kernel, it should have been that simplicity outweighs insignificant improvements elsewhere. We have seen this in the growth of virtualization and cloud computing, where the tradeoff between new features and performance loss has proved to be irrelevant.

There are bigger issues to address. Simplicity is *the* key to scalability. Features like online self-healing, online upgrade, online node addition/removal, HTTP based object protocol support, compression/encryption support, HDFS APIs, and certificate based security are complex in their own right. Necessitating that they be in kernel space only adds to the complexity, thus hampering progress and development. Kernel mode programming is too complex, too restrictive and unsustainable in many ways. It is hard to find kernel hackers, hard to write code and debug in kernel mode, and it is hard to handle hardware reliability when you scale out due to multiple points of failure.

GlusterFS got its inspiration from the GNU Hurd kernel. Many years before, GNU Hurd was able to mount tar balls as a filesystem, FTP as a filesystem, and POP3 as an mbox file. Users could extend the operating system in clever ways. A FUSE-like user space architecture was an inherent part of the Hurd operating system design. Instead of treating filesystems as a module of the operating system, Hurd treated Filesystems as the operating system. All parts of the operating system were developed as stackable modules, and Hurd handled hardware abstraction. Didn’t we see the benefits of converging the volume manager, software RAID and filesystem in ZFS? GNU Hurd took it a step further, and GlusterFS brought it to the next level with Linux and other Unix kernels. It treats the Linux kernel as a microkernel that handles hardware abstraction and broaches the subject that everyone is thinking, if not stating outloud: the cloud is the operating system. In this brave new world, stuffing filesystems into kernel space is counter-productive and hinders development. GlusterFS has inverted the stack, with many traditional kernel space jobs now handled in user space.

In fact, when you begin to see the cloud and distributed computing as the future (and present), you realize that the entire nomenclature of user space vs. kernel space is anachronistic. In a world where entire operating systems sit inside virtualized containers in user space, what does it even mean to be kernel space any more? Looking at the broader trends, arguing against user space filesystems is like arguing against rising and falling tides. To suggest that nothing significant is accomplished in user space is to ignore all major computing advances of the last decade.

To solve 21st-century distributed computing problems, we needed 21st-century tools for the job, and we wrote them into GlusterFS. GlusterFS manages most of the operating system functionality within its own user space, from memory management, IO scheduling, volume management, NFS, and RDMA to RAID-like distribution. For memory management, it allocates large blocks for large files, resulting in far fewer page table entries, and it is easier to garbage collect in user space. Similarly with IO scheduling, GlusterFS uses elastic hashing across nodes and IO-threads within the nodes. It can scale threads on demand and group blocks belonging to the same inodes together, eliminating disk contention. GlusterFS does a better job of managing its memory or scheduling, and the Linux kernel doesn’t have an integrated approach. It is user-space storage implementations that have scaled GNU/Linux OS beyond petabytes seamlessly. That’s not my opinion, it’s a fact: the largest deployments in the world are all user-space. Whats wrong with FUSE simplying filesystem development to the level of toy making? 🙂

Some toys are beautiful and work better than others.

    1. Similar architectures, if I recall, although I’ll defer to someone who’s actually knowledgeable. I would say that efforts like Plan9 inspired Hurd.

  1. hi. i have been watching the glusterFS project for more than a year. i like glusterFS and its design and also believe in userspace is good. but this article has a few funny stuff. the author is clearly grossly technically not solid. the mistakes is so loud –

    “as the creator of GlusterFS,” – it is funny how the author claims this like a marketing hype. there is not a single line of code commit in the entire project repository history.

    “For memory management, it allocates large blocks for large files, resulting in far fewer page table entries” – this is such a ridiculous statement, there is no way this could even have been a typo. this is such a type of error that it takes all the technical credibility away. i begin to doubt if this “creator” was ever a programmer, let alone system programmer (there is no code in his own creation! where else could it be then?). there is just no way the user space program can influence the number of page table entries for a size of memory. it is purely dependent on the hardware architecture. the author is making a joke out of himself here.

    “and it is easier to garbage collect in user space.” – on what grounds is the author saying this? glusterFS does not have garbage collection. if you are implementing your own garbage collector in a C program, it makes no difference if it is in user space or kernel space.

    “Similarly with IO scheduling, GlusterFS uses elastic hashing across nodes and IO-threads within the nodes. It can scale threads on demand and group blocks belonging to the same inodes together, eliminating disk contention.” – this sentance talks about two concepts trying to convince the user with confusion. elastic hashing is a nice concept. that is not an IO scheduler. glusterFS does only aggregation of small sequential io into 128kb chunks. it is wrong to claim this reduced disk contention. sitting in userspace it has no idea or control of disk head or spin up/downs. it can only HOPE the kernel IO scheduler can does a good job on its behalf. glusterFS is clueless about disk contention – there is no synchronization among its IO threads to minimize disk seeks. sitting in userspace it is always clueless about finding the least expensive path of IO operations to minimze disk contention.

    “GlusterFS does a better job of managing its memory or scheduling, and the Linux kernel doesn?t have an integrated approach. ” – these are plain wrong statements. glusterFS USES linux kernel’s memory manager like any other software. all it does is malloc and mmap like all user space programs. it is absurd to say glusterfs is doing any memory management (other than its own virtual memory, which is backed by the kernel memory manager)

    the rest of the article is more marketing fluff filled with buzzwords and very little technical points. it sounds like a poorly done marketing work of art, at best. the author just seems to have a good “approximate” understanding of technical terms sufficient to write an article which can convince and impress a non technical person with hard-to-argue false claims and buzzwords.

    novak

    1. I can assure you, AB Periasamy is no marketing dweeb, and is very much a systems programmer. But I’ll let him speak to that 🙂

      You should also read the Jeff Darcy piece he linked to.

      -JM, Gluster Community Guy

  2. If he isn’t, what about those technical mistakes in the article? Surely no one will argue they’re typos? Most of those did fly over my head, as I’m no systems programmer – but one thing didn’t, and that was the marketing speak. The cloud is the OS? The cloud is the future? Scratch that, the present? Uhm, nice that you have a product you’re proud of (if you have it, that is – what was that about no code commits?), but could you refrain from spewing crap like that when trying to debunk someone (even if that someone is the perpetually annoying I’m-so-fucking-sure-of-myself-and-my-attitudes-that-are-usually-based-on-nothing-but-software-developer-machismo Linus Torvalds)? Thanks.

  3. I have been using glusterfs for 3 years. It connects a 32-node cluster to five 12T storage nodes. It is simple, scalable and very stable.

    We tried lusterfs before glusterfs, it is good, but not what we want. Here is the reasons
    1, It is tied to a special kernel, which limits the storage’s usability. We want to use the storage for HPC, storage, backup, etc.
    2, It requires some dedicated meta nodes, however, we always want to strench each dollar.
    3, Now it is owned by Oracle.

    We also waited for the pNFS, but my life is too short.

    Other than user-space file systems, if somebody knows a open source or free distributed kernel-space file system that can handle millions of small files well for HPC and file service, please let us know.

    In my opinion, GlusterFS is the best distributed filesystem as of today. No matter how user-space file system is discriminated, it will grow bigger. Eventually it will put Linux on a bigger stage.

  4. Carter Novak, perhaps you ought to Google his name before you start making grand pronouncements about his claims to be the creator of GlusterFS and insult his technical background. You’re making yourself look like a fool.

  5. Anandbabu, I have heard your talks, and liked them. However here, I am afraid that all of your argument points lack data to back them. I would expect more from a technical person who seems to have implemented a complex, distributed file system. IMHO, you seem to have completely missed the context, in which, Linus made his comments.Would you recommend GlusterFS or any user space file system for a root file system? Can you function without a kernel based file system?

    I can talk more on the technical ineptitude apart from the ones, listed by me and others but will refrain. Do not use jargon to your advantage, and confound unsuspecting/not knowledgeable audiences.

  6. Plan 9 is a research toy, it has provided several nice insights on high level design and a well regarded network filesystem
    Hurd is a philosophical toy, off the top of my head I do not recall other systems taking from its design.
    Mach is the micro-kernel usually admired e.g. Apple used it to underpin their BSDish Mac OS/X system.
    There was also MkLinux, it ran Linux under the Mach micro-kernel.
    I had thought that Hurd was reimplementing the Mach system using only FSF approved tools, and that it was still alpha code, though I have not looked at it in quite awhile.

    Now I like the idea of user space filesystems, usually as a network server in the plan 9 style rather than say NFS.
    I admit I feel that FUSE seems like a odd duck to me, but I am old school from the days when K&R was C
    and we chipped our compilers out of flint.

    However user space filesystems, are almost always second class systems whether FUSE or network server
    DMA, page size, memory allocation, VFS, page cache, are in general abstracted out by the kernel on purpose,
    this is both the advantage and problem in user space.
    It is the advantage that user space is not bound by kernel memory limits and weird restrictions required in kernel mode.
    It is also the problem in that some of the neat tricks cannot be preformed e.g. when page faults are hidden from you.

    The folding of LVM and RAID into filesystems has struck me as a step backwards, there already was a LVM and two RAID systems (md & dm) in the kernel, adding the extras for ZFS so Sun disks could be directly mounted may have
    made some sense, but still struck me as a layering violation…
    Btrfs doing the same thing seemed more so, as it was not dealing with a legacy disk format from another system.
    This may be another case of my old fashioned opinions getting in the way.
    Now I will admit that the file system having knowledge of the RAID layout and LVM may allow for extra tricks,
    I would have preferred the native LVM and RAID being enhanced with the requisite support for said tricks.
    Perhaps, this will yet come to be.

  7. @johnsmith – i said no commits by the creator in the entire project repository. of course there are a huge number of other commits. i have used the software and like it.

    @vidar – googling a name means squat. in fact following your suggestion i did google his name. after many result pages, only references to blogs, tweets, talks, articles, conferences, no code. i see he is a free software activist. he is member of many projects, but very little code in all those projects from him. many of them zero code from him. with what hacker ethics can one claim to be a member (or even creator) of a project with zero code commits? i have no problems with him not coding – it is his personal choice. but it just shows in his post he is an average programmer with very poor system knowledge and yet making absurd claims with the pretense of system programming expertise. in what way do you explain the page table entry claim!!! how do you explain a person with such “expertise” to claim such as “garbage collection is easier in userspace”. you don’t need google search results (loaded with marketing fluff) to claim expertise. technical expertise is purely evaluated by correct technical claims (not the case here) and good code (nowhere to be found).

    i am happy to be proofed wrong. please show me good code and articles with technical rigor without marketing fluff.

    novak

  8. Manhong Dai:
    GFS 2?

    I worry that in reality I’m being trolled by two Linus AND Anand. This isn’t some inside joke is it?

  9. I have to agree with Carter Novak. Some more gems:

    >>Hypervisors are the modern micro kernels. Microkernel is not about size, but about what should be in kernel mode.

    I am assuming you meant protected modes. Both the hypervisor and the guest kernels run in protected modes. But yes, the guest is on a higher ring. That has some similarity with micro kernels in that the various parts of the kernel operate in different rings (micro in ring 0). But that’s where the similarity ends.

    If you look at the trends in virtualization, you will notice that performance still trumps and that vendors are busy introducing technologies that allow guest direct access to the hardware. SRIOV is one example. That goes against the basic tenet of a micro kernel.

    The goal in virtualization is to allow flexibility in hardware without any loss of performance. Simplicity be damned.

    >> Kernel-space does a poor job of handling large amounts of memory with 4k pages. If you see the bigger picture, disks and memory have grown much larger, and user requirements have grown 1000-fold.

    And a userspace can do no better if it too has to track 1TB in 4K chunks. The solution is taking advantage of large pages in the cpu. Most Unices support that. (Just google super-pages.) It was added in Linux within the past year.
    Also, while you don’t say it, I am assuming you are exploiting the hugepages capability that Linux provides to create large pages. If so, you are exploiting a feature the kernel provides. A kernel fs can use it if it wants to also.

    >> Once upon a time, Linus eschewed microkernels for a monolithic architecture for sake of simplicity…. instead of the lesson being that all important stuff gets thrown into the kernel, it should have been that simplicity outweighs insignificant improvements elsewhere.

    No. The Linux developers are loathe to add code that is better served in the user-land. It should be noted that a kernel developer asked whether overlayfs should be moved to user and the person who-wrote-fuse (Miklos) replied that it would be more efficient in kernel space.

    >> Features like online self-healing, online upgrade, online node addition/removal, HTTP based object protocol support, compression/encryption support, HDFS APIs, and certificate based security are complex in their own right.

    Really? You think compression and encryption is better served in userspace? Why? btrfs, zfs are already doing it in kernel space. And I believe more file systems will tackle atleast encryption soon. Online self healing is mkting jargon. Online upgrade, node removal, etc…. what makes userspace better than kernelspace? Most cluster file systems in kernel support all that. And if they don’t, it is not because of the kernel.

    >> Kernel mode programming is too complex, too restrictive and unsustainable in many ways.

    Complex? While It is not easy, there are thousands of people doing just that.
    Unsustainable? In what way? Are you worried developers of ntfs, ext4, zfs, hfs are on their last legs?

    >> GlusterFS got its inspiration from the GNU Hurd kernel.

    How many users does Hurd have?

    This blog is full of the same tired arguments professed by the micro kernel crowd. These arguments are not new. They have been made for over 20 years.

    While userspace file systems make sense for certain workloads, usecases, etc., Linus is essentially correct that overlayfs will be better served in the kernel than in user.

  10. carter novak wrote:

    > ?For memory management, it allocates large blocks for large files, resulting in far fewer page
    > table entries? ? this is such a ridiculous statement, there is no way this could even have been a
    > typo. this is such a type of error that it takes all the technical credibility away. i begin to doubt if
    > this ?creator? was ever a programmer, let alone system programmer (there is no code in his
    > own creation! where else could it be then?). there is just no way the user space program can
    > influence the number of page table entries for a size of memory. it is purely dependent on the
    > hardware architecture. the author is making a joke out of himself here.

    Check the facts. Look at the man page of mmap(2), in particular the MAP_HUGETLB flag. If you’re too lazy, let me quote it for you:

    > MAP_HUGETLB (since Linux 2.6.32)
    > Allocate the mapping using “huge pages.” See the kernel source
    > file Documentation/vm/hugetlbpage.txt for further information.

    In i686 platforms, allows you to use pages of 4MB instead of 4kB. Using huge page has been possible for quite some time. It requires hardware support which has been there for many years already, available in processors as old as Pentium Pro:

    http://en.wikipedia.org/wiki/Page_%28computer_memory%29#Huge_pages

    > ?and it is easier to garbage collect in user space.? ? on what grounds is the author saying
    > this? glusterFS does not have garbage collection. if you are implementing your own garbage
    > collector in a C program, it makes no difference if it is in user space or kernel space.

    Every language doing garbage collection is “doing it in user space”. Modern garbage collection is a highly computational intensive tasks. Doing it in user space means (1) you don’t need to yield the CPU from time to time, the kernel will do it for you; (2) you don’t need to worry about stack usage, the swap space is helping you; and (3) you can use any library you want, so those library can also benefit from (1) and (2). Make quite a difference, I think.

    > ?GlusterFS does a better job of managing its memory or scheduling, and the Linux kernel
    > doesn?t have an integrated approach. ? ? these are plain wrong statements. glusterFS USES
    > linux kernel?s memory manager like any other software. all it does is malloc and mmap like all
    > user space programs. it is absurd to say glusterfs is doing any memory management (other
    > than its own virtual memory, which is backed by the kernel memory manager)

    Many user programs are doing memory management. It starts with the C library: when you do malloc(3), you are exercising its memory manager. It does at least two things. (1) it decides whether sbrk(2) or mmap(2) should be used to fulfill any memory request, and (2) if sbrk(2) is used, it organizes data structure to keep track of memory allocated to the application and freed to the C library, so that freed memory not suitable for returning to the OS can be reused by later malloc. Those memory are user memory, so the OS will do everything it can to collect them, to put them in swap, etc. That doesn’t mean the user memory doesn’t have a distinctive role in managing the memory. And, once you use huge page, the OS is largely hands-off: it doesn’t do much to try swapping them out. So the application will be solely responsible for its management until the application ends or until you do a munmap on it.

  11. @Isaac:

    > ?For memory management, it allocates large blocks for large files, resulting in far fewer page
    > table entries?

    the statement made by the author is “it allocates large blocks”. that is a definitive statement which can be true only if his software has any awareness of huge pages. there is no traces of that in code. that is a clear lie.

    > Every language doing garbage collection is ?doing it in user space?. Modern garbage collection is a highly
    > computational intensive tasks. Doing it in user space means (1) you don?t need to yield the CPU from time to
    > time, the kernel will do it for you; (2) you don?t need to worry about stack usage, the swap space is helping you;
    > and (3) you can use any library you want, so those library can also benefit from (1) and (2). Make quite a
    > difference, I think.

    computationally intensive has no relation to userspace / kernel space. kernel threads dont need to yield themselves. for a very long time kernel threads have been preemptive. standard garbage collection technique is to have garbage collection threads which would be treated appropriately by the scheduler to swap in/out as necessary. the more important point here is, the author claiming his software doing garbage collection – which it clearly does not. yet another false claim.

    >> ?GlusterFS does a better job of managing its memory or scheduling, and the Linux kernel
    >> doesn?t have an integrated approach. ?

    > Many user programs are doing memory management. It starts with the C library: when you do malloc(3), you
    > are exercising its memory manager. It does at least two things. (1) it decides whether sbrk(2) or mmap(2)
    > should be used to fulfill any memory request, and (2) if sbrk(2) is used, it organizes data structure to keep track
    > of memory allocated to the application and freed to the C library, so that freed memory not suitable for returning
    > to the OS can be reused by later malloc. Those memory are user memory, so the OS will do everything it can
    > to collect them, to put them in swap, etc. That doesn?t mean the user memory doesn?t have a distinctive role in
    > managing the memory. And, once you use huge page, the OS is largely hands-off: it doesn?t do much to try
    > swapping them out. So the application will be solely responsible for its management until the application ends or
    > until you do a munmap on it.

    what you say above is just theory. what the author claims is that his software does a better and “integrated” memory management job than the linux kernel. and i repeat, by doing just sbrk() and mmap() no process can do a better “integrated” job at memory management. by virtue of being in user space a program just cannot see (let alone manage) a whole bunch of memory regions. the linux kernel implements a very sophestecated vm with configurable pressure and timeout management of memory buffers between application requests, page cache, dcache etc. THAT is a truly “integrated” memory management. implementing your own malloc cannot be callend a better and “integrated” memory manager – it only falls into the application bucket of the true integrated memory manager sitting below you in the kernel.

    you as a reader of the post may claim correct general theory, but the reason i write these posts is because i have been following the project repository for around about an year and i know for fact that the claims the author is made is NOT in the software. and the fact that the author not having any commits in the entire history of the project makes me more suspicious he is making clueless claims disconnected from reality.

    novak

  12. This looks more like a Marketing stunt than an actual technical argument, I openly support Carter’s argument.

    I have deployed and used GlusterFS its design is elegant and it works for me. But what the Author suggests here is not true and i can openly say this with practical data if needed.

    I wonder why would Gluster try to write such a erroneous blog in the first place, it already has a product which is working fine but has bugs which get fixed subsequent releases.

    Rather than focusing on documentation and fixing bugs this is a unworthy childish article.
    I even stopped reading half way, since most of it is just weird and science fiction.

    What Linus said is wrong in general, but the context they are discussing is way different. Not reading the entire email thread leads everyone to think that Author of this article hurried into writing something entirely out of context to get Media attention for nothing.

  13. Y’all are being trolled by the author. He’s clearly got no other intention than stir up some controversy to bring traffic. There is no technical argument here.

  14. @jim
    “Plan 9 is a research toy, it has provided several nice insights on high level design and a well regarded network filesystem
    Hurd is a philosophical toy, off the top of my head I do not recall other systems taking from its design.”

    I noticed that you pointed to no benchmarks or studies confirming your opinion. You merely presented your bias as if it were fact. It is not. There are no shortage of people who use Plan 9 or Hurd for serious business. If you can’t think of exotic operating systems (or user space file systems) as more than toys, you are lacking in engineering imagination.

  15. I apologize for lack of details and vagueness in my post. It was no marketing stunt or secret agenda to promote GNU Hurd ;). All of the blame is attributed to my poor writing and public speaking skills. I wrote the above post as a comment to Jeff Darcy’s blog. He articulated really well why Linus was wrong about User Space filesystems. My disagreement is only about Linus’s views on user space filesystems in general. I have tremendous respect for Linus otherwise.

    I have added specific pointers to my claims and some details. Responses below:

    @novak I am sorry for writing vaguely on a very sensitive topic, without sufficient clarification. I can see that you are really upset at me. If you are in the Bay Area, we can settle the issues over beer :).

    I never said, Linux kernel is not doing memory management and disk io-scheduling. I used the word “page table entries” loosely. I agree, pages and page-tables are normally referred to hardware and kernel’s memory manager, but there is no reason to reserve these terms for hardware or kernel mode alone. All I am trying to say is, there are simpler and better ways to deal with large memory cache and I/O scheduling from the user space. Most of the scalable applications today do not depend on kernel’s abilities for high performance. There are sufficient libraries (obstacks, libgc..) and data-structures (hash tables, b-tree) to make cache management simpler in user space. Key is in knowing the context of the data and control of the events. GlusterFS being a filesystem in user space, is better equipped to handle memory and disk I/O well. It is OK for Java based Hadoop HDFS to claim itself as a filesystem, even though it is not POSIX. Some of the largest storage systems (OpenStack Swift, Amazon S3) ditched POSIX in favor of simple HTTP RESTful GET/PUT APIs. I am OK for root filesystems to be in the kernel, but claiming all userspace fs as toys and misguided is simply ridiculous.

    Will you argue that Java VM doesn’t do any memory management?. In Mach, virtual memory manager runs in user space. Some useful reading:
    Memory handling in GNU/Hurd: http://kilobug.free.fr/hurd/pres-en/abstract/html/node9.html
    External Memory Management: http://www.linuxselfhelp.com/gnu/machinfo/html_chapter/mach_6.html
    OS X Mach: http://docs.huihoo.com/darwin/kernel-programming-guide/Mach/chapter_6_section_5.html
    In GlusterFS, memory management code (libglusterfs/src/iobuf.c, xlators/performance/io-cache/src/page.c, xlators/performance/quick-read/src/quick-read.c), you will see the same terminologies used freely. Check this link for variable sized iobufs: http://patches.gluster.com/patch/7429/ . Just grep for “prune” across the source tree, You will find how resources are garbage collected depending on their context. Each glusterfs translator that requires caching, decides on its own when to reclaim its memory. GlusterFS RPM installation always tunes the linux kernel to not act smart. Under normal circumstances, kernel probably helps more than being a over head. When it comes to solid state disks or high performance streaming applications, we in fact use O_DIRECT (“option o-direct enable” under storage/posix) to get the kernel completely out of its way. On servers with 16GB memory (which is very common today), these settings help GlusterFS perform much better:

    # /etc/sysctl.conf
    vm.swappiness = 0 # Tell kernel to not worry about paging.
    vm.vfs_cache_pressure = 10000 # Tell kernel not to retain dentry and inode caches as much as possible.

    Similarly, see this code for io-scheduling (xlators/performance/io-threads/src/io-threads.c). You will find inode based queuing of operations in 3.0 code base. Look for function: iot_create_inode_worker_assoc(..) https://github.com/gluster/glusterfs/blob/v3.0.8/xlators/performance/io-threads/src/io-threads.c . If you see the latest 3.2 code base https://github.com/gluster/glusterfs/blob/release-3.2/xlators/performance/io-threads/src/io-threads.c , we switched to a better model. Since, read-ahead, write-behind and quick-read like modules took care of block aggregation and pre-fetching, we noticed that performance bottleneck came from meta-data operations. They were chocking because other slower I/O operations were ahead of them in the queue. Queuing meta-data calls that require low-latency with a higher priority, even if they arrive later in the sequence, improved overall responsiveness of the filesystem. So we moved to priority based queues and scheduling operations based on their context. Threads scale up and down depending on the load. Older versions used adaptive-least-usage scheduling https://github.com/gluster/glusterfs/blob/release-2.0/scheduler/alu/src/alu.c , but we noticed users often gravitated towards round-robin scheduling for its simplicity. When we implemented elastic hashing for small files, we realized it was simple and better for most use cases. We ended up standardizing on elastic hashing at the high-level, instead of user-selectable list of options (adaptive-least-usage, round-robin, pattern-based, non-uniform-file-acess and random) https://github.com/gluster/glusterfs/tree/release-2.0/scheduler .

    @johnsmith We can sit and argue micro-kernel vs monolithic-kernel endlessly. That is not the point I am trying to make. There is lot more scope for performance improvement in GlusterFS (like write-caching, attribute-caching, flash-cache and some more optimizations to existing performance modules). We are not so much concerned about these minor improvements. Today if you look around. Hadoop HDFS is written in Java. Google filesystem (http://bit.ly/k78hlq) manages worlds largest storage pools with centralized meta-data and large blocks (eg. 64MB and 1MB). They work around their inefficiencies with smarter application code. GlusterFS tries to strike a balance between Google like philosophy and NetApp like interface. Google’s common-sense approach is smarter. End of the day it solves real world problems. POSIX sucks, but compatibility is necessary for few more years. Modern operating systems will look like Amazon cloud environment for server computing and Android/iOS like for desktop/mobile computing. OpenStack and Red Hat Enterprise Virtualization like projects are implementing these features in Python and Java. APIs are not POSIX, but RESTful. No concept of UID/GID, but certificates. They all see Linux as a mere hypervisor and hardware abstraction layer. We will always need kernel, but how much it should do is changing. Even for a system project like IPMI (Intelligent Platform Management Interface), I was able to implement it entirely in user space, including its device drivers (@novak GNU FreeIPMI, FYI I gave up my maintainership to LLNL after we started Gluster). There is no reason for OpenIPMI project to be in kernel mode. FreeIPMI powers some of the top supercomputers in the world. It is inevitable, Linux kernel will have to adapt to the modern computing environment.

    @Manhong Thanks for the positive comment.

    @Vidar Thanks.

    @Stephan No, GlusterFS was designed to solve large scale unstructured data storage problem. For root filesystems (primarily holding operating system files) kernel mode disk filesystems (like Ext4, XFS) are great. If you are running databases, which perform lots of small block synchronous operations, kernel mode filesystems are unbeatable. However, FUSE based NTFS proved to be faster than kernel based NTFS. Which proves that, for most use cases FUSE is good enough. There is no argument that user space programming is easier than kernel space programming. Linus always believed that user space filesystem is a bad idea. It is not the first time he vented out. See his response while FUSE patch was submitted for merging: http://lwn.net/Articles/112414/ . It is frustrating to hear FUSE based filesystems are only good for toys and we are misguided.

    @jim I don’t know much about Plan9, so I cannot comment on it. Hurd is not about reimplementing Mach using GNU tools. FSF got the rights to Mach from CMU and derived GNU Mach from it. GNU Mach acted as GNU Hurd’s micro kernel. GNU Hurd is multi-server OS on top of GNU Mach micro kernel. Bulk of the operating system functionality was handled at the Hurd layer. Linux’s success is largely because of Linus’s leadership. Hurd failed mostly because of GNU Mach and lack of leadership. Mach is like a 98 year old grandma with no health insurance. You can still learn valuable lessons from her experience. Success = Failure v3.0. Hurd would have succeeded if FSF forked a version of Linux kernel into a micro kernel. Actually Thomas Bushnell (Hurd Architect) initially planned to use 4.4BSD-Lite kernel as his micro kernel. Because of lack of cooperation from the Berkeley guys, RMS proposed Mach. Now we know it was a mistake. I realized, Mach was going no where when I worked on IPC interposing framework and e1000 driver porting. It looked as if, it was easier to fix a working Mach base than L4 like micro kernels. Mach’s IPC layer was great for computer science text books, but the not for the real world. Mach’s abstractions were unnecessary. Hurd envisioned a micro-kernel-object-model to be portable across micro kernels. All sounds great only in theory. Hurd would have been great on multi-core virtualization friendly 64 bit CPUs. It is OK to be inspired by Hurd and bring its best to the GNU/Linux OS. If you are interested read this paper: Towards a New Strategy of OS Design, an architectural overview by Thomas Bushnell: http://www.gnu.org/software/hurd/hurd-paper.html

    End of the day, GlusterFS has lot of production deployments. It solves real world problems without worrying too much about minor irrelevant details. It is also free software.

  16. Way back in the dim mists of time, when microkernels were hot, Linus was having flame wars with AT, and the HURD actually sounded like it might someday be usable, I worked on the IBM microkernel project writing filesystem benchmarks. The idea was to build OS/2 (!), AIX, and other OS personalities on top of Mach, as multiple user-space servers.

    There were a lot of things wrong with the implementation (getpid() required a trip through Mach to an OS server), but the one thing that was a continual problem was filesystem performance. At one point, reading a page off disk was CPU bound. The issue wasn’t context switches, really, which can themselves be made arbitrarily fast, but carrying pages of data along with the context switch. The original copy-on-write turned out to be a pessimization—a very-well-written memcpy took a little less time to copy a page than the vm system took to set up the page share and copy-on-write. Then, if you eventually had to copy the page….

    I never figured out what was going on with the virtual memory system, but I’ve seen the same issue over and over. The MkLinux paper at the (1st and only, AFAIK) FSF conference is one of my favorite performance papers ever. The table showing performance comparisons looks good, as long as you didn’t notice that the top third was dhrystones; MkLinux didn’t actually slow the processor down. The Scout OS, a good research uK OS reported excellent performance, but didn’t separate user-space processes into different memory domains. The numbers reported by the security-enhanced follow-on, whose name I can’t remember but which did use multiple memory domains, was much worse. If you read between the lines of that paper comparing it to Linux, it would still have been a small multiple slower than Linux if it only used 2 protection domains instead of 3 for filesystem accesses.

    I’m as big a fan of user-level filesystems as anyone. They provide a neat and protected way of doing things that would otherwise not be as clean. On the other hand, to say “user space advantages far outweigh kernel space advantages” as a general rule is a pretty thoroughly blinkered view, too.

    Oh, and if you want the references to the MkLinux, Scout, and whatever-that-other-one-was papers, I’ll try to round them up.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s