Ceph Deployment at Target: Best Practices and Lessons Learned


In October 2014, the first Ceph environment at Target, a U.S. based international chain of department stores, went live. In this insightful slide show (embedded at the end of this post), Will Boege, Sr. Technical Architect at Target, talks about the process, highlighting challenges faced and lessons learned in Target’s first ‘official’ Openstack release.

Ceph was selected to replace the traditional array-based approach that was implemented in a prototype Havana environment. Will outlines the criteria for the move in four succinct bullets:

  • The traditional storage model was problematic to integrate
  • Maintenance and purchase costs from array vendors could become prohibitive
  • Traditional storage area networks just didn’t “feel” right in his space
  • Ceph integrated tightly with OpenStack

Ceph was to be used for:

  • RBD for Openstack instances and volumes
  • RADOSGW for object
  • RBD backing Celiometer MongoDB volumes

Initial deployment included three monitor nodes, twelve OSD nodes, two 10GBE per host, and a basic LSI ‘MegaRaid’ controller.

After the rollout it became evident that there were performance issues within the environment, issues that led to the first lesson learned: instrument your deployment. Compounding the performance issues were mysterious reliability issues, which led to the second lesson learned: do your research on hardware your server vendor provides. And this was just the beginning.

To learn more about Ceph, or to take a free test drive, please visit the Red Hat Ceph Storage homepage.

Ceph’s on the move… and coming to a city near you!

What’s that saying: Ya can’t keep a good person down? Well, ya can’t keep a good technology contained—and that’s why Ceph’s been appearing at venues across the globe.

Ceph Day just hit Chicago

Most recently—this past August—Ceph made its way to Chicago, home of Chicago-style pizza and hot dogs, a place known worldwide for its Prohibition-era ruckus as well as its present-day spirits and brews. There, at Ceph Day Chicago, Bloomberg’s Chris Jones, senior cloud infrastructure, architecture/DevOps, explained how Ceph helps power storage at the financial giant.

After that, Paul von Stammwitz of Fujitsu led a session about bringing Ceph storage to the enterprise, and Red Hat’s director of community, Patrick McGarry, rounded out the day with an update on the Ceph ecosystem.

For the full list of the event’s attendees and topics, check out the agenda.

Our next stop is Cupertino


So what’s the next stop on the Ceph tour? Red Hat Storage Day Cupertino! Slated for October 15 at Seagate headquarters, the event promises a full slate of Ceph-related information, including:

  • Why software-defined storage is driving a shift in the industry
  • How Red Hat Storage solutions are suited for enterprise workloads
  • What deployment practices work best and where they’ve worked in the real world

To learn more about this event and register, click here. Want to learn more about Red Hat Storage solutions? Click here to learn more about Red Hat Gluster Storage—Test drive it for free here. To learn more about Red Hat Ceph Storage, click here. Test drive it—also for free—here.

Happy storing! Hope to see you in Cupertino!

The Top 5 Q&As From Sage Weil’s Recent Reddit AMA


Sage Weil, Red Hat’s chief architect of Ceph and co-creator of Cephamong many other credentials – recently held an “ask me anything” session on Reddit. Though you can read the whole thing for yourself, here, we’ve collected the top questions and answers for your edification. Read on!

Q. from /u/weetabeex: Being really bloody complex notwithstanding, are there any plans for proper geo-replication in the near future? I am assuming this has been discussed over and over again, so I wonder: do you think the consistency semantics will need to be relaxed to make this work?

How cool would it be to start a Jewel blueprint out of your reddit (hopefully super detailed) reply?

A. There are currently two geo-replication development projects underway: a v2 of the radosgw multiside federation, and RBD journaling for georeplication. The former will be eventually consistent (across zones), while RBD obviously needs to be point-in-time consistent at the replica.

We have also done some preliminary work to do async replication at the rados pool level. Last year we worked with a set of students at HMC to build a model for clock synchronization, verifying that we can get periodic ordering consistency points across all OSDs that could be replicated to another cluster. The results were encouraging and we have an overall architecture in mind… but we still need to put it all together.

Q from /u/emkt11: How you think Ceph will benefit from btrfs and zfs ? Also can we use journals from journaled file system e.g. ext4, rather then having ceph its own journal. Also does newstore enables ceph to avoid 2x * no. of replica write to only no. of replica write for every single write. Also what is the timeline aimed for Jewel?

A: btrfs and zfs both give you two big things: checksumming of all data (yay!) and copy-on-write that we can use to efficieintly clone objects (e.g., for rbd snapshots). The cost is fragmentation for small io workloads… which costs a lot on spinning disks. I’m eager to see how this changes with widely deployed SSDs.

We can’t make much use of existing fs journals because they’re tightly bound to the POSIX semantics and data model the file system provides.. which is not what Ceph wants. We work in terms of larger transactions over lots of objects, and after several years of pounding my head against it I’ve decided trying to cram that down a file systems’ throat is a losing battle.

Instead, newstore manages its own metadata in a key/value database (rocksdb currently) and uses a bare minimum from the underlying fs (object/file fragments). It does avoid the 2x write for new objects, but we do still double-write for small IOs (where it is less painful).

Newstore will be in Jewel but still marked experimental–we likely won’t have confidence by then it won’t eat your data.

Q from /u/ivotron: Nowadays startups are doing great work that, in some cases, compete against research projects from universities (without the burden of having to write papers!). Would you advise for people to go to grad school when they have a specific project/idea in mind they want to develop? In your opinion, what are the pros and cons of the academic vs. startup route?

A: I’ll start by saying I have a huge bias toward free software. If the choice is between research that will result in open publications (and hopefully open sourced code, or else IMO you’re doing it wrong) and a startup writing proprietary code, there’s no contest. If the startup is developing open source code, it’s a trickier question.

I do get frustrated that a lot of research work is poorly applied: students build a prototype that works just well enough to generate the graphs but Picture2is a long way from being something that is useful or usable by the real world. The most common end result is that the student finishes their degree, the code is thrown away, and some proprietary software shop takes any useful ideas and incorporates them into their product line (and tries to hire the student). Working for a startup forces you to create something that is viable and useful to real customers, and if it’s open source delivers real value to the industry.

This is probably a good time to plug CROSS, the new Center for Research in Open Source Systems at UCSC (https://cross.soe.ucsc.edu). One of the key ideas here is to bridge the gap between what students do for their graduate research and what is needed for an open source project to survive in the wild with an incubation / fellowship. It’s a unique approach to bringing the fruits of investment in research into the open source community and I’m really excited that the program is now officially off the ground!

Q. from /u/optimusC: What is the largest size of Ceph cluster you’ve seen so far in production today?

A: The largest I’ve worked with was ~1300 OSDs. The largest I’ve heard of was CERN’s ~7000 OSD test they did a few months back.

Right now our scaling issues are around OSD count. You can build much larger clusters (by an order of magnitude) by putting OSDs on top of RAID groups instead of individual disks, but we mostly haven’t needed to do this yet.

Q. from /u/nigwil: What needs to be added to Ceph to allow it to replace Lustre for HPC workloads?

A: Possibly RDMA? XioMessenger is coming along so maybe that will kickstart HPC interest.

The largest friction we’ve seen in the HPC space is that all of the hardware people own is bought with Lustre’s architecture in mind: it’s all big disk arrays with hardware RAID and very expensive. It’s needed for Lustre because it is scale-out but not replicated–each array is fronted by a failover pair of OST’s.

Ceph is designed to use more commodity hardware and do its own replication.

Putting a ‘production ready’ stamp on CephFS will help, but for HPC is silly–the thing preventing us from doing that is an fsck tool, which Lustre has never had.

Q. from /u/bstillwell: What new storage technologies (NVMe, SMR, kinetic drives, ethernet drives, etc.) excite you most? Why?

A: NVNe will be big, but it’s a bit scary because it’s not obvious what we will be changing and rearchitecting to use it most effectively.

SMR is annoying because we’ve been hearing about it for years but there’s still nothing very good for dealing with it. The best idea I’ve heard so far would push the allocator partly into the drive so that you’d saw “write these blocks somewhere” and the ack would tell you where they landed. There are some libsmr type projects out there that are promising, and I’d love to see these linked into a Ceph backend (like NewStore, where they’d fit pretty easily!).

Ethernet drives are really exciting, as they are exactly what we had in mind when we designed and architected Ceph. There is a big gap between the prototype devices (which we’ve played with and work!) and being buy them in quantity, though, that still makes my brain hurt. There are a few things we can/should do in Ceph to make this story more compelling (aarch64 builds coming soon!) but mostly it’s a waiting game it seems.

Kinetic drives are cool in the same sense that ethernet drives are, except that they’ve fixed on an interface that Ceph must consume…which means we still need servers sitting in front of them. We have some prototype support in Ceph but the performance isn’t great because the places we use key/value APIs assume lower latency…but I think we’ll be able to plug them into NewStore more effectively. We’ll see!

Check out our rich media deep-dive webinar – on demand!

Want to know more about rich media? In a recent webinar, Red Hat Senior Solutions Architect Kyle Bader took a deep dive into rich media and the unique demands it places on storage systems. We recap some highlights from the webinar here, but please register for the on-demand version here to get the full experience.


What you’ll learn
When you register, you’ll learn that building Ceph clusters to address the challenges of rich media requires its own set of architectural considerations, such as providing high throughput and cost/capacity optimization. You’ll also learn about:

  • Selecting the right hardware
  • Configuring your network optimally
  • Tuning Linux and Ceph to help achieve your goals
  • Achieving a Ceph configuration optimal for your needs
  • Identifying a tuned configuration


About Kyle

And you’re in good, knowledgeable hands in this webinar: Your speaker, Kyle, who joined Red Hat as part of the 2014 Inktank acquisition, has expertise in the design and operation of petabyte-scale storage systems using Ceph. He’s implemented and designed Ceph-based storage systems for DreamHost, DreamObjects, and DreamCompute cloud products, among others.

Take our test drive

And of course, there’s even more. Did you know that you can test drive Red Hat Ceph Storage—for free? Check it out here.

GlusterFS among the elite!

Screen Shot 2015-09-18 at 9.10.29 AM

Score one more for Red Hat Storage! In case you didn’t hear, GlusterFS is the proud recipient of a 2015 Bossie Award, InfoWorld’s top picks in open source datacenter and cloud software. Highly influential worldwide among technology and business decision makers alike, the IDC publication selected GlusterFS as one of its top picks for 2015.

An increasingly attractive option

GlusterFS is an open, software-defined storage system, able to distribute data across commodity hardware. It was acquired by Red Hat in 2011 and, with Ceph, serves as a critical component in the Red Hat Storage portfolio. Together, the two technologies are an attractive alternative to traditional NAS and SAN architectures, which tend to cost significantly more and lack the flexibility required by today’s petabyte-scale workloads–workloads like enterprise virtualization, cloud infrastructure, containers, and big data analytics.

All the benefits of open source

By decoupling from physical storage hardware, open, software-defined storage allows users to tap into a lower cost, standardized supply chain. Additional benefits include:

  • Increased operational efficiency of a scale-out architecture
  • Greater programmability and control from intelligence built into the software
  • Flexibility and quality focus of an open development process through the open source community

Try it for yourself

Red Hat continues to steadfastly support GlusterFS in the open source community, including the most recent GlusterFS 3.7 release, available here. Want to try it for yourself–and for free? Register here to test drive the fully supported Red Hat Gluster Storage 3.1.

Stay tuned for more, and happy storing!


Get every new post delivered to your Inbox.

Join 3,238 other followers

%d bloggers like this: