Picture1

Sage Weil, Red Hat’s chief architect of Ceph and co-creator of Cephamong many other credentials – recently held an “ask me anything” session on Reddit. Though you can read the whole thing for yourself, here, we’ve collected the top questions and answers for your edification. Read on!

Q. from /u/weetabeex: Being really bloody complex notwithstanding, are there any plans for proper geo-replication in the near future? I am assuming this has been discussed over and over again, so I wonder: do you think the consistency semantics will need to be relaxed to make this work?

How cool would it be to start a Jewel blueprint out of your reddit (hopefully super detailed) reply?

A. There are currently two geo-replication development projects underway: a v2 of the radosgw multiside federation, and RBD journaling for georeplication. The former will be eventually consistent (across zones), while RBD obviously needs to be point-in-time consistent at the replica.

We have also done some preliminary work to do async replication at the rados pool level. Last year we worked with a set of students at HMC to build a model for clock synchronization, verifying that we can get periodic ordering consistency points across all OSDs that could be replicated to another cluster. The results were encouraging and we have an overall architecture in mind... but we still need to put it all together.

Q from /u/emkt11: How you think Ceph will benefit from btrfs and zfs ? Also can we use journals from journaled file system e.g. ext4, rather then having ceph its own journal. Also does newstore enables ceph to avoid 2x * no. of replica write to only no. of replica write for every single write. Also what is the timeline aimed for Jewel?

A: btrfs and zfs both give you two big things: checksumming of all data (yay!) and copy-on-write that we can use to efficieintly clone objects (e.g., for rbd snapshots). The cost is fragmentation for small io workloads... which costs a lot on spinning disks. I'm eager to see how this changes with widely deployed SSDs.

We can't make much use of existing fs journals because they're tightly bound to the POSIX semantics and data model the file system provides.. which is not what Ceph wants. We work in terms of larger transactions over lots of objects, and after several years of pounding my head against it I've decided trying to cram that down a file systems' throat is a losing battle.

Instead, newstore manages its own metadata in a key/value database (rocksdb currently) and uses a bare minimum from the underlying fs (object/file fragments). It does avoid the 2x write for new objects, but we do still double-write for small IOs (where it is less painful).

Newstore will be in Jewel but still marked experimental--we likely won't have confidence by then it won't eat your data.

Q from /u/ivotron: Nowadays startups are doing great work that, in some cases, compete against research projects from universities (without the burden of having to write papers!). Would you advise for people to go to grad school when they have a specific project/idea in mind they want to develop? In your opinion, what are the pros and cons of the academic vs. startup route?

A: I'll start by saying I have a huge bias toward free software. If the choice is between research that will result in open publications (and hopefully open sourced code, or else IMO you're doing it wrong) and a startup writing proprietary code, there's no contest. If the startup is developing open source code, it's a trickier question.

I do get frustrated that a lot of research work is poorly applied: students build a prototype that works just well enough to generate the graphs but Picture2 is a long way from being something that is useful or usable by the real world. The most common end result is that the student finishes their degree, the code is thrown away, and some proprietary software shop takes any useful ideas and incorporates them into their product line (and tries to hire the student). Working for a startup forces you to create something that is viable and useful to real customers, and if it's open source delivers real value to the industry.

This is probably a good time to plug CROSS, the new Center for Research in Open Source Systems at UCSC (https://cross.soe.ucsc.edu). One of the key ideas here is to bridge the gap between what students do for their graduate research and what is needed for an open source project to survive in the wild with an incubation / fellowship. It's a unique approach to bringing the fruits of investment in research into the open source community and I'm really excited that the program is now officially off the ground!

Q. from /u/optimusC: What is the largest size of Ceph cluster you've seen so far in production today?

A: The largest I've worked with was ~1300 OSDs. The largest I've heard of was CERN's ~7000 OSD test they did a few months back.

Right now our scaling issues are around OSD count. You can build much larger clusters (by an order of magnitude) by putting OSDs on top of RAID groups instead of individual disks, but we mostly haven't needed to do this yet.

Q. from /u/nigwil: What needs to be added to Ceph to allow it to replace Lustre for HPC workloads?

A: Possibly RDMA? XioMessenger is coming along so maybe that will kickstart HPC interest.

The largest friction we've seen in the HPC space is that all of the hardware people own is bought with Lustre's architecture in mind: it's all big disk arrays with hardware RAID and very expensive. It's needed for Lustre because it is scale-out but not replicated--each array is fronted by a failover pair of OST's.

Ceph is designed to use more commodity hardware and do its own replication.

Putting a 'production ready' stamp on CephFS will help, but for HPC is silly--the thing preventing us from doing that is an fsck tool, which Lustre has never had.

Q. from /u/bstillwell: What new storage technologies (NVMe, SMR, kinetic drives, ethernet drives, etc.) excite you most? Why?

A: NVNe will be big, but it's a bit scary because it's not obvious what we will be changing and rearchitecting to use it most effectively.

SMR is annoying because we've been hearing about it for years but there's still nothing very good for dealing with it. The best idea I've heard so far would push the allocator partly into the drive so that you'd saw "write these blocks somewhere" and the ack would tell you where they landed. There are some libsmr type projects out there that are promising, and I'd love to see these linked into a Ceph backend (like NewStore, where they'd fit pretty easily!).

Ethernet drives are really exciting, as they are exactly what we had in mind when we designed and architected Ceph. There is a big gap between the prototype devices (which we've played with and work!) and being buy them in quantity, though, that still makes my brain hurt. There are a few things we can/should do in Ceph to make this story more compelling (aarch64 builds coming soon!) but mostly it's a waiting game it seems.

Kinetic drives are cool in the same sense that ethernet drives are, except that they've fixed on an interface that Ceph must consume...which means we still need servers sitting in front of them. We have some prototype support in Ceph but the performance isn't great because the places we use key/value APIs assume lower latency...but I think we'll be able to plug them into NewStore more effectively. We'll see!