What about locality?

This is the first post of a multi-part series of technical blog posts on Spark on Ceph:

  1. What about locality?
  2. Anatomy of the S3A filesystem client
  3. To the cloud!
  4. Storing tables in Ceph object storage
  5. Comparing with HDFS—TestDFSIO
  6. Comparing with remote HDFS—Hive Testbench (SparkSQL)
  7. Comparing with local HDFS—Hive Testbench (SparkSQL)
  8. Comparing with remote HDFS—Hive Testbench (Impala)
  9. Interactive speedup
  10. AI and machine learning workloads
  11. The write firehose

Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:

“Will performance suck because the benefits of locality are lost?”

It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.

Historical context

Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.

Cluster of 1800 machines, [each with]:

  • 4GB of memory
  • Dual-processor 2 GHz Xeons with hyperthreading
  • Dual 160GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth of 100 Gb/s

If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.

The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.

Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.

While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.


When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.


Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.

One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.

Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.

In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.

In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.


Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.

Why Spark on Ceph? (Part 3 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we in the Red Hat Storage solutions architecture team wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (this blog post)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series, providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Findings summary

We did Ceph vs. HDFS testing with a variety of workloads (see blog Part 2 of 3 for general workload descriptions). As expected, the price/performance comparison varied based on a number of factors, summarized below.

Clearly, many factors contribute to overall solution price. As storage capacity is frequently a major component of big data solution price, we chose it as a simple proxy for price in our price/performance comparison.

The primary factor affecting storage capacity price in our comparison was the data durability scheme used. With 3x replication data durability, a customer needs to buy 3PB of raw storage capacity to get 1PB of usable capacity. With erasure coding 4:2 data durability, a customer only needs to buy 1.5PB of raw storage capacity to get 1PB of usable capacity. The primary data durability scheme used by HDFS is 3x replication (support for HDFS erasure coding is emerging, but is still experimental in several distributions).  Ceph has supported either erasure coding or 3x replication data durability schemes for years. All Spark-on-Ceph early adopters we worked with are using erasure coding for cost efficiency reasons. As such, most of our tests were run with Ceph erasure coded clusters (we chose EC 4:2). We also ran some tests with Ceph 3x replicated clusters to provide apples-to-apples comparison for those tests.

Using the proxy for relative price noted above, Figure 1 provides an HDFS v. Ceph price/performance summary for the workloads indicated:

Figure 1: Relative price/performance comparison, based on results from eight different workloads

Figure 1 depicts price/performance comparisons based on eight different workloads. Each of the eight individual workloads was run with both HDFS and Ceph storage back-ends. The storage capacity price of the Ceph solution relative to the HDFS solution is either the same or 50% less. When the workload was run with Ceph 3x replicated clusters, the storage capacity price is shown as the same as HDFS. When the workload was run with Ceph erasure coded 4:2 clusters, the Ceph storage capacity price is shown as 50% less than HDFS. (See the previous discussion on how data durability schemes affect solution price.)

For example, workload 8 had similar performance with either Ceph or HDFS storage, but the Ceph storage capacity price was 50% of the HDFS storage capacity price, as Ceph was running an erasure coded 4:2 cluster. In other examples, workloads 1 and 2 had similar performance with either Ceph or HDFS storage and also had the same storage capacity price (workloads 1 and 2 were run with a Ceph 3x replicated cluster).

Findings details

A few details are provided here for the workloads tested with both Ceph and HDFS storage, as depicted in Figure 1.

  1. This workload was a simple test to compare aggregate read throughput via TestDFSIO. As shown in Figure 2, this workload performed comparably between HDFS and Ceph, when Ceph also used 3x replication. When Ceph used erasure coding 4:2, the workload performed better than either HDFS or Ceph 3x for lower numbers of concurrent clients (<300). With more client concurrency, however, the workload performance on Ceph 4:2 dropped due to spindle contention (a single read with erasure coded 4:2 storage requires 4 disk accesses, vs. a single disk access with 3x replicated storage.)

    Figure 2: TestDFSIO read results
  2. This workload compared the SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2.

    Figure 3: Single-user Spark query set results
  3. This workload compared Impala query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was 57% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  4. This mixed workload featured concurrent execution of a single-user running SparkSQL queries (54), 10-users each running Impala queries (54 each), and a data set merge/join job enriching TPC-DS web sales data with synthetic clickstream logs. As illustrated in Figure 1, the aggregate execution time of this mixed workload on Ceph EC4:2 was 48% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  5. This workload was a simple test to compare aggregate write throughput via TestDFSIO. As depicted in Figure 1, this workload performed, on average, 50% slower on Ceph EC4:2 compared to HDFS, across a range of concurrent client/writers. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  6. This workload compared SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2. However, price/performance was nearly comparable when running against Ceph EC4:2, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  7. This workload featured enrichment (merge/join) of TPC-DS web sales data with synthetic clickstream logs, and then writing the updated web sales data. As depicted in Figure 4, this workload was 37% slower on Ceph EC4:2 compared to HDFS. However, price/performance was favorable for Ceph, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.

    Figure 4: Data set enrichment (merge/join/update) job results
  8. This workload compared the SparkSQL query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was roughly comparable to that of HDFS, despite requiring only 50% the storage capacity costs. Price/performance for this workload thus favors Ceph by 2x. For more insight into this workload performance, see Figure 5. In this box-and-whisker plot, each dot reflects a single SparkSQL query execution time. As each of the 10-users concurrently executes 54 queries, there are 540 dots per series. The three series shown are Ceph EC4:2 (green), Ceph 3x (red), and HDFS 3x (blue). The Ceph EC4:2 box shows comparable median execution times to HDFS 3x, and shows more consistent query times in the middle 2 quartiles.
Figure 5: Multi-user Spark query set results

Bonus results section: 24-hour ingest

One of our prospective Spark-on-Ceph customers recently asked us to illustrate Ceph cluster sustained ingest rate over a 24-hour time period. For these tests, we used variations of the lab as described in blog 2 of 3. As noted in Figure 6, we measured a raw ingest rate of approximately 1.3 PiB per day into a Ceph EC4:2 cluster configured with 700 HDD data drives (Ceph OSDs).

Figure 6: Daily data ingest rate into Ceph clusters of various sizes

Concluding observations

In conclusion, below is our formative cost/benefit analysis of the above results summarizing this blog series.

  • Benefits, Spark-on-Ceph vs. Spark on traditional HDFS:
    1. Reduce CapEx by reducing duplication: Reduce PBs of redundant storage capacity purchased to store duplicate data sets in HDFS silos, when multiple analytics clusters need access to the same data sets.
    2. Reduce OpEx/risk: Eliminate costs of scripting/scheduling data set copies between HDFS silos, and reduce risk-of-human-error when attempting to maintain consistency between these duplicate data sets on HDFS silos, when multiple analytics clusters need access to the same data sets.
    3. Accelerate insight from new data science clusters: Reduce time-to-insight when spinning-up new data science clusters by analyzing data in-situ within a shared data repository, as opposed to hydrating (copying data into) a new cluster before beginning analysis.
    4. Satisfy different tool/version needs of different data teams: While sharing data sets between teams, enable users within each cluster to choose the Spark/Hadoop tool sets and versions appropriate to their jobs, without disrupting users from other teams requiring different tools/versions.
    5. Right-size CapEx infrastructure costs: Reduce over-provisioning of either compute or storage common with provisioning traditional HDFS clusters, which grow by adding generic nodes (regardless if only more CPU cores or storage capacity is needed), by right-sizing compute needs (vCPU/RAM) independently from storage capacity needs (throughput/TB).
    6. Reduce CapEx by improving data durability efficiency: Reduce CapEx of storage capacity purchased by up to 50% due to Ceph erasure coding efficiency vs. HDFS default 3x replication.
  • Costs, Spark-on-Ceph vs. Spark on traditional HDFS:

    1. Query performance: Performance of Spark/Impala query jobs ranged from 0%-131% longer execution times (single-user and multi-user concurrency tests).
    2. Write-job performance: Performance of write-oriented jobs (loading, transformation, enrichment) ranged from 37%-200%+ longer execution times. [Note: Significant improvements in write-job performance are expected when downstream distributions adopt the following upstream enhancements to the Hadoop S3A client HADOOP-13600, HADOOP-13786, HADOOP-12891].
    3. Mixed-workload Performance: Performance of multiple query and enrichment jobs concurrently executed resulted in 90% longer execution times.

For more details (and a hands-on chance to kick the tires of this solution yourself), stay tuned for the architect-level blog series in this same Red Hat Storage blog location. Thanks for reading.

Why Spark on Ceph? (Part 2 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (this blog post)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph (Part 3 of 3)“)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Basic analytics pipeline using a Ceph object store

Our early adopter customers are ingesting, querying, and transforming data directly to and from a shared Ceph object store.  In other words, target data locations for their analytics jobs are something like “s3://bucket-name/path-to-file-in-bucket” within Ceph, instead of something like “hdfs:///path-to-file”.  Direct access to S3-compatible object stores via analytics tools like Spark, Hive, and Impala is made possible via the Hadoop S3A client.

Jointly with several customers, we successfully ran 1000s of analytics jobs directly against a Ceph object store using the following analytics tools:

Figure 1: Analytics tools tested with shared Ceph object store

In addition to running simplistic tests like TestDFSIO, we wanted to run analytics jobs which were representative of real-world workloads. To do that, we based our tests on the TPC-DS benchmark for ingest, transformation, and query jobs. TPC-DS generates synthetic data sets and provides a set of sample queries intended to model the analytics environment of a large retail company with sales operations from stores, catalogs, and the web. Its schema has 10s of tables, with billions of records in some tables. It defines 99 pre-configured queries, from which we selected the 54 most IO-intensive for out tests. With partners in industry, we also supplemented the TPC-DS data set with simulated click-stream logs, 10x larger than the TPC-DS data set size, and added SparkSQL jobs to join these logs with TPC-DS web sales data.

In summary, we ran the following directly against a Ceph object store:

  • Bulk Ingest (bulk load jobs – simulating high volume streaming ingest at 1PB+/day)
  • Ingest (MapReduce jobs)
  • Transformation (Hive or SparkSQL jobs which convert plain text data into Parquet or ORC columnar, compressed formats)
  • Query (Hive or SparkSQL jobs – frequently run in batch/non-interactive mode, as these tools automatically restart failed jobs)
  • Interactive Query (Impala or Presto jobs)
  • Merge/join (Hive or SparkSQL jobs joining semi-structured click-stream data with structured web sales data)

Architecture overview

We ran variations of the tests outlined above with 4 large customers over the past year. Generally speaking, our architecture looked something like this:

Figure 2: High-level lab architecture

Did it work?

Yes. 1000s of analytics jobs described above completed successfully. SparkSQL, Hive, MapReduce, and Impala jobs all using the S3A client to read and write data directly to a shared Ceph object store. The related architect-level blog series will document detailed lessons learned and configuration techniques.

In the final episode of this blog series, we’ll get to the punch line – what was the performance compared to traditional HDFS? For the answer, continue to Part 3 of this series….

Why Spark on Ceph? (Part 1 of 3)

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (this blog post)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph? (Part 3 of 3)”)

We’ll provide summary-level answers to these questions in a 3-part blog series. In addition, for those wanting more depth, we’ll cross-link to a separate reference architecture blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Part 1: Why would companies do this?

Agility of many, the power of one.
The agility of many analytics clusters, with the power of one shared data store.
(Ok … enough with the simplistic couplets.)

Here are a few common problems that emerged from speaking with 30+ companies:

  • Teams that share the same analytics cluster are frequently frustrated because someone else’s job often prevents their job from finishing on-time.
  • In addition, some teams want the stability of older analytic tool versions on their clusters, while their peer teams need to load the latest-and-greatest tool releases.
  • As a result, many teams demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own needs.
  • However, each separate analytics cluster typically has its own, non-shared HDFS data store – creating data silos.
  • And to provide access to the same data sets across the silos, the data platform team frequently copies datasets between the HDFS silos, trying to keep them consistent and up-to-date.
  • As a result, companies end up maintaining many separate, fixed analytics clusters (50+ in one case), each with their own HDFS data silo containing redundant copies of PBs of data, while maintaining an error-prone maze of scripts to keep data sets updated across silos.
  • But, the resulting cost of maintaining 5, 10, or 20 copies of multi-PB datasets on the various HDFS silos is cost prohibitive to many companies (both CapEx and OpEx).

In pictures, their core problems and resulting options look something like this:

Figure 1. Core problems


Figure 2. Resulting Options

Turns out that the AWS ecosystem built a solution for choice #3 (see Figure 2 above) years ago through the Hadoop S3A filesystem client. In AWS, you can spin-up many analytics clusters on EC2 instances, and share data sets between them on Amazon S3 (e.g. see Cloudera CDH support for Amazon S3). No more lengthy delays hydrating HDFS storage after spinning-up new clusters, or de-staging HDFS data upon cluster termination. With the Hadoop S3A filesystem client, Spark/Hadoop jobs and queries can run directly against data held within a shared S3 data store.  

Bottom-line … more-and-more data scientists and analysts are accustomed to spinning-up analytic clusters quickly on AWS with access to shared data sets, without time-consuming HDFS data-hydration and de-stage cyles, and expect the same capability on-premises.

Ceph is the #1 open-source, private-cloud object storage platform, providing S3-compatible object storage. It was (and is) the natural choice for these companies looking to provide an S3-compatible shared data lake experience to their analysts on-premises.

To learn more, continue to the next post in this series, “Why Spark on Ceph? (Part 2 of 3)”Will mainstream analytics jobs run directly against a Ceph object store?

Introducing Red Hat Storage One

More than a year ago, our Storage Architecture team set out to answer the question of how we can overcome the last barriers to software-defined storage (SDS) adoption. We know from our thousands of test cycles and hundreds of hours of data analysis that a properly deployed Gluster or Ceph system can easily compete with—and often surpass—the feature and performance capabilities of any proprietary storage appliance, usually at a fraction of the cost based on our experience and rigorous study. We have many customer success stories to back up these claims with real-word deployments for enterprise workloads. However, one piece of feedback is consistent and clear: Not every customer is willing or prepared to commit the resources necessary to architect a system to the best standards for their use case. The barrier to entry is simply higher than for a comparative proprietary appliance that is sold in units of storage with specific workload-based performance metrics. The often-lauded flexibility of SDS is, in these cases, its Achilles’ heel.

Continue reading “Introducing Red Hat Storage One”

Container-native storage 3.9: Enhanced day-2 management and unified installation within OpenShift Container Platform

By  Annette Clewett, Humble Chirammal, Daniel Messer, and Sudhir Prasad

With today’s release of Red Hat OpenShift Container Platform 3.9, you will now have the convenience of deploying Container-Native Storage (CNS) 3.9 built on Red Hat Gluster Storage as part of the normal OpenShift deployment process in a single step. At the same time, major improvements in ease of operation have been introduced to give you the ability to monitor provisioned storage consumption, expand persistent volume (PV) capacity without downtime, and use a more intuitive naming convention for persistent volume names.

Continue reading “Container-native storage 3.9: Enhanced day-2 management and unified installation within OpenShift Container Platform”

The third one’s a charm

By Federico Lucifredi, Red Hat Storage



Red Hat Ceph Storage 3 is our annual major release of Red Hat Ceph Storage, and it brings great new features to customers in the areas of containers, usability, and raw technology horsepower. It includes support for CephFS, giving us a comprehensive, all-in-one storage solution in Ceph spanning block, object, and file alike. It introduces iSCSI support to provide storage to platforms like VMWare ESX and Windows Server that currently lack native Ceph drivers. And we are introducing support for client-side caching with dm-cache.

On the usability front, we’re introducing new automation to manage the cluster with less user intervention (dynamic bucket sharding), a troubleshooting tool to analyze and flag invalid cluster configurations (Ceph Medic), and a new monitoring dashboard (Ceph Metrics) that brings enhanced insight into the state of the cluster.

Last, but definitely not least, containerized storage daemons (CSDs) drive a significant improvement in TCO through better hardware utilization.

Containers, containers, never enough containers!

We graduated to fully supporting our Ceph distribution running containerized in Docker application containers earlier in June 2017 with the 2.3 release, after more than a year of open testing of tech preview images.

Red Hat Ceph Storage 3 raises the bar by introducing colocated CSDs as a supported configuration. CSDs drive a significant TCO improvement through better hardware utilization; the baseline object store cluster we recommend to new users spans 10 OSD storage nodes, 3 MON controllers, and 3 RGW S3 gateways. By allowing colocation, the smaller MON and RGW nodes can now run colocated on the OSDs’ hardware, allowing users to avoid not only the capital expense of those servers but the ongoing operational cost of managing those servers. Pricing that configuration using a popular hardware vendor, we estimate that users could experience a 24% hardware cost reduction or, in alternative, add 30% more raw storage for the same initial hardware invoice.

“All nodes are storage nodes now!”

We are accomplishing this improvement by colocating any of the Ceph scale-out daemons on the storage servers, one per host. Containers reserve RAM and CPU allocations that protect both the OSD and the co-located daemon from resource starvation during rebalancing or recovery load spikes. We can currently colocate all the scale-out daemons except the new iSCSI gateway, but we expect that in the short term MON, MGR, RGW, and the newly supported MDS should take the lion’s share of these configurations.

As my marketing manager is found of saying, all nodes are storage nodes now! Just as important, we can field a containerized deployment using the very same ceph-ansible playbooks our customers are familiar with and have come to love. Users can conveniently learn how to operate with containerized storage while still relying on the same tools—and we continue to support RPM-based deployments. So if you would rather see others cross the chasm first, that is totally okay as well—You can continue operating with RPMs and Ansible as you are accustomed to.

CephFS: now fully supportawesome

The Ceph filesystem, CephFS for friends, is the Ceph interface providing the abstraction of a POSIX-compliant filesystem backed by the storage of a RADOS object storage cluster. CephFS achieved reliability and stability already last year, but with this new version, the MDS directory metadata service is fully scale-out, eliminating our last remaining concern to its production use. In Sage Weil’s own words, it is now fully awesome!

“CephFS is now fully awesome!” —Sage Weil

With this new version, CephFS is now fully supported by Red Hat. For details about CephFS, see the Ceph File System Guide for Red Hat Ceph Storage 3. While I am on the subject, I’d like to give a shout-out to the unsung heroes in our awesome storage documentation team: They have steadily introduced high-quality guides with every release, and our customers are surely taking notice.

iSCSI and NFS: compatibility trifecta

Earlier this year, we introduced the first version of our NFS gateway, allowing a user to mount an S3 bucket as if it was an NFS folder, for quick bulk import and export of data from the cluster, as literally every device out there speaks NFS natively. In this release, we’re enhancing the NFS gateway with support for NFS v.3 alongside the existing NFS v.4 support.

The remaining leg of our legacy compatibility plan is iSCSI. While iSCSI is not ideally suited to a scale-out system like Ceph, the use of multipathing for failover makes the fit smoother than one would expect, as no explicit HA is needed to manage failover.

With Red Hat Ceph Storage 3, we’re bringing to GA the iSCSI gateway service that we have been previewing during the past year. While we continue to favor the LibRBD interface as it is more featureful and delivers better performance, iSCSI makes perfect sense as a fall-back to connect VMWare and Windows servers to Ceph storage, and generally anywhere a native Ceph block driver is not yet available. With this initial release, we are supporting VMWare ESX 6.5, Windows Server 2016, and RHV 4.x over an iSCSI interface, and you can expect to see us adding more platforms to the list of supported clients next year as we plan to increase the reach of our automated testing infrastructure.

¡Arriba, arriba! ¡Ándale, ándale!

Red Hat’s famous Performance and Scale team has revisited client-side caching tuning with the new codebase and blessed an optimized configuration for dm-cache that can now be easily configured with Ceph-volume, the new up-and-coming tool that is slated by the Community to eventually give the aging ceph-disk a well-deserved retirement.

Making things faster is important, but equally important is insight into performance metrics. The new dashboard is well deserving of a blog on its own right, but let’s just say that it plainly makes available a significant leap in performance monitoring to Ceph environments, starting with the cluster as a whole and drilling into individual metrics or individual nodes as needed to track down performance issues. Select users have been patiently testing our early builds with Luminous this summer, and their consistently positive feedback makes us confident you will love the results.

Linux monitoring has many flavors, and while we supply tools as part of the product, customers often want to integrate their existing infrastructure, whether it is Nagios alerting in very binary tones that something seems to be wrong, or another tool. For this reason, we joined forces with our partners at Datadog to introduce a joint configuration for SAAS monitoring of Ceph using Datadog’s impressive tools.

Get the stats

More than 30 features are landing in this release alongside our rebasing of the enterprise product to the Luminous codebase. These map to almost 500 bugs in our downstream tracking system and hundreds more upstream in the Luminous 12.2.1 release we started from. I’d like to briefly call attention to about 20 of them that our very dedicated global support team prioritized for us as the most impactful way to further smooth out the experience of new users and move forward on our march toward making Ceph evermore enterprise-ready and easy to use. This is our biggest release yet, and its timely delivery 3 months after the upstream freeze is an impressive achievement for our Development and Quality Assurance engineering teams.

As always, those of you with an insatiable thirst for detail should read the release notes next—and feel free to ping me on Twitter if you have any questions!

The third one’s a charm

Better economics through improved hardware utilization, great value-add for customers in the form of new access modes in file, iSCSI, and NFS compatibility, joined by improved usability and across-the-board technological advancement are the themes we tried to hit with this release. I think we delivered… but we aren’t done yet. We plan to send more stuff your way this year! Stay tuned.

But if you can’t wait to hear more about the new object storage features in Red Hat Ceph Storage 3, read this blog post by Uday Boppana.

Gluster linear scaling: How to choose wisely

We talk a lot about the linear scalability of Red Hat Gluster Storage, and we can generally back that up with empirical data. Indeed, homogeneously scaling out the storage nodes and network infrastructure can result in both capacity and throughput capabilities that are directly proportional. But it’s important to note that this is potential scalability, and how you use the volumes plays a vital role in the experience you have.

We architect optimal solution recommendations based on a few expectations:

  1. Most of the workload falls into a particular category—high throughput, small file, or latency sensitive, for example.
  2. When your capacity needs grow, so do your concurrent client demands.
  3. You’re using the glusterfs native client.

Let’s take a look at these points and how they affect your real scalability.

Architecting for workload

We know through thousands of test cycle results that there is a generally optimal server configuration that will apply broadly to a majority of workloads. This compiled knowledge is a huge benefit to you, the user, and it can greatly reduce your own time commitment in designing and testing fundamental system architectures. However, just up the stack from the server and network components are low-level configuration choices that you will make for every deployment. These choices are the big knobs—Particular to your workload there is likely one best choice for peak performance. And it’s important to note that these aren’t choices you can easily change later. Changes at these layers likely require moving data, potentially more than once, and data has inertia.

When you understand your majority workload, and preferably you isolate dislike workloads entirely, you will be positioned to make choices about server density (12, 24, or higher drive capacity), block-level configurations (e.g., HDD vs. SSD, RAID vs. JBOD, caching vs. not, block and stripe sizes), and Gluster volume geometry (e.g., replicated vs. dispersed, failure resiliency, arbiter bricks, tiering). Locked into these choices and the related workload, you’ll find it reasonably simple to integrate new nodes and bricks into the volume for predictable capacity and performance expansion.

Client concurrency

So you’ve built to your workload and everything is great. That is, unless your expectations aren’t aligned with a scale-out solution. Any single connection to the storage pool is bound by physics. One client communicates over one network link to one server to one file system and block stack. Sure, some design options allow for single-client concurrency to multiple stacks, but those come at a trade off, and each connection is still bound by physics and bottlenecked somewhere along the line. So if your need is to provide expanded throughput capabilities to a single or a small number of clients, you will likely find that horizontal scale-out won’t give you much performance benefit. There are some tricks we can use to architect for such a need, but it will never be an efficient solution.

To that end, an optimal design assumes you are operating at an appropriate client:server concurrency ratio. The best ratio will vary with your workload and the architecture decisions you make per the preceding discussion, but you can expect for most cases a ratio range of 12:1 to 48:1 to be appropriate for peak or plateau storage throughput capabilities. So if you build out a 12-node storage pool based on your capacity needs and then expect 4 client systems to use that storage concurrently, you’ll bottleneck on the server node I/O stack long before you saturate the aggregate system capabilities. But with an appropriate concurrent client count of say 150+ for your 12 server nodes, you may be operating at the peak capabilities of the system.

Client choice

Great! So you’re heeding all the advice here, and you’re going to deploy 12 Red Hat Gluster Storage nodes in an optimal architecture for 150 NFS clients. Well, hold on there a minute, buckaroo. We’re more than happy to support the NFS client, but you should know what you’re getting into.

When using the Gluster native client, data placement calculations are made on the client side. This means that each client is fully aware of the volume geometry and all server nodes participating, allowing it to determine how the data protection scheme is applied and which nodes and backend filesystems (bricks) each file will be written to. All client-to-server connections are then made efficiently based on this client-side intelligence. And because data placement among the distributed system is done pseudo-randomly, there is a statistically even distribution of work between the clients and servers and therefore predictable performance scalability.

When choosing NFS (or SMB), a client will make its connections to a single Gluster server. That server then has to apply the client-side intelligence for data resilience, conversion, and placement, and it will then make secondary network calls out to each participating server node for the file transaction. This inefficiency leads to a concurrency bottleneck far below the capabilities of the native client—You’ll still hit peak throughput at about the same client:server ratio, but that throughput will be well below what can be achieved on the same systems with the native client.

The one surprise that can come up with the NFS client is that if you do indeed require a lower client:server ratio, NFS can in some conditions outperform the native client at that concurrency level. YMMV on this, and you’ll still be far below the peak capabilities of the system, but it’s worth testing out if you’re absolutely determined to connect your 4 clients to your 12 Gluster nodes (but don’t say I didn’t warn you not to).

Oh yeah? Prove it.

Lucky for you, I did that already. Take a look at our published reference architectures and, in particular, our most recent Gluster Performance and Sizing Guide. And keep an eye out here for future publications as we continue to expand and refine our data.

Container-native storage for the OpenShift masses

By Daniel Messer, Red Hat Storage


Red Hat Container-Native Storage 3.6, released today, reaches a new level of storage capabilities on the OpenShift Container Platform. Container-native storage can now be used for all the key infrastructure pieces of OpenShift: the registry, logging, and metrics services. The latter two services come courtesy of the new block storage implementation. Object storage is now also available directly to developers in the form of the well-known S3 API. Administrators will enjoy a more robust cns-deploy utility, support for online volume expansion, and more choice in deployment topologies in the OpenShift Advanced Installer. Last, but just as important, it now supports more concurrent workloads serving over 1,000 persistent volumes with just 3 nodes.


You know you must be doing something right when some of your users are looking to use your technology in different ways than expected. Initially, the idea of running GlusterFS alongside Kubernetes and OpenShift promised the ability to use a distributed storage system with a framework for distributed applications. They goes nicely together because both approaches are entirely based on scale-out software, hence independent of the underlying platform, and they are driven by a declarative API-driven design. On the GlusterFS side, that API is available in the form of an additional software daemon, called heketi. Things soon took a new direction when the first experiments of running the GlusterFS/heketi combination as an OpenShift workload were conducted.

A lot of engineering cycles later, the idea of hacking GlusterFS onto OpenShift has emerged to a fully supported product offering: container-native storage. Today, we are happy to announce container-native storage 3.6.

For the impatient: In essence, we have taken container-native storage from being an optional supplement in OpenShift to being a storage solution that now serves file, block, and object storage to applications on top of OpenShift and to the entire OpenShift internal infrastructure, as well.

For the curious reader, let’s go see how we did that….

Increase density

The first thing we had to do was ensure that container-native storage was a robust, scalable, long-term solution for the different possible OpenShift cluster sizes. When we launched container-native storage with OpenShift 3.2 last summer, the container images were based on Red Hat Gluster Storage 3.1.3 and, on average, each brick process on a GlusterFS host/pod consumed about 300 MB of RAM.

That may not sound like much, but you have to be aware that every PersistentVolume served by container-native storage results in a GlusterFS volume being created. Bricks are local directories on GlusterFS pods that make up volumes. The consistency of volumes across all its bricks (by default, 3 in container-native storage) is handled by the glusterfsd process, which is what consumes the memory.

In older releases of Red Hat Gluster Storage, there was one such process per brick on each host. It’s easy to see that with potentially hundreds of application pods in OpenShift requiring their own PersistentVolumes, the resulting number of brick processes in each GlusterFS pod will easily consume gigabytes of RAM and would create a significant effort to coordinate in each pod.

That many processes in a pod are an anti-pattern for Kubernetes and, even if we would have broken out those in separate containers, the memory overhead would still be huge.

Fortunately, Red Hat Gluster Storage 3.3 came to the rescue. Released just a little over 2 weeks ago, it introduced a new feature called brick-multiplexing. It’s easier to depict how this feature changes the structure of a GlusterFS pod in a diagram than a lengthy explanation:

With brick-multiplexing, only one glusterfsd process is governing the bricks such that the amount of memory consumption of GlusterFS pods is drastically reduced and the scalability is significantly improved.

By introducing brick-multiplexing in version 3.6, we are able to support over 1,000 PersistentVolumes in a single container-native storage cluster. The amount of memory consumed increases linearly, so that 32GB of RAM are only needed at the high end of that. The rule of thumb is roughly 30-35 MB RAM per volume on each of the participating GlusterFS pods.

Container-native storage can probably support an even greater number of volumes, and we hope to confirm that soon. Until then, you always have the option to either run more GlusterFS pods in your OpenShift cluster or deploy a second container-native storage cluster, governed by the same Heketi API service.

Optimized storage for logging/metrics

File storage is what containers on OpenShift (and in general) deal with today. It’s a ubiquitous, well-understood concept. There are also proposals for native access to block devices in pods, but they are still in design or planning phases.

That is—at least for now—storage (including block) in Kubernetes and OpenShift always ends up being a mounted file system on the host running the pod, which is then bind-mounted to the target container’s file system namespace. Block storage provisioners in OpenShift eventually format the device with XFS too, before handing it over to the container.

GlusterFS is a distributed, networked file system which, in contrast to local filesystems like XFS, allows shared access from multiple hosts and stores the data in the backend distributed across multiple nodes. This big advantage does not come without cost, however: Some type of operations that are fast and cheap on a local file system are quite expensive in a distributed file system.

For some workloads (e.g., OpenShift Logging and Metrics), this can be a show-stopper. To properly support those, we designed something that might seem counter-intuitive at first: gluster-block. Take a look at the implementation scheme below:

Yes, you see that right: We are using TCM (the Linux kernel’s iSCSI stack, also called LIO) managed by targetcli to create iSCSI LUNs from files on a GlusterFS volume and present those as block devices to pods. The TCM stack allows local storage of a Linux system to be made available on the network via the iSCSI protocol. In our specific case, the local storage is a large raw file on a GlusterFS volume. On the client side, the iSCSI block device will be formatted with XFS and then bind-mounted to the target container’s file system namespace.

But why go through all the trouble? In distributed file systems—and here GlusterFS is no exception—metadata-intensive operations like file create, file open, or extended attribute updates are particularly expensive and slow compared to a local file system. In particular, indexing solutions likes ElasticSearch (part of OpenShift Logging) and scale-out NoSQL databases like Cassandra (part of OpenShift Metrics) generate such workloads.  But also other database software might make heavy use of locking and byte-range locking, which are costly compared to simple read and writes.

In order to qualify OpenShift Metrics and Logging Services to run well on a container-native storage backend, a significant speed up was needed for a lot of special file system operations like these.

You can probably guess what we were thinking: In software, many problems can be solved by adding an additional layer of indirection.

The indirection in accessing data on GlusterFS via iSCSI instead of a normal GlusterFS mount converts otherwise expensive file system operations to a single stream of continuous reads and writes to a single raw file on GlusterFS. The TCM stack delivers this IO stream over the network via iSCSI. On the receiving end, the file in GlusterFS backing the iSCSI LUN is accessed via libgfapi, a userspace library to access files in GlusterFS without the need to mount a volume.

The clients, in our case containers in pods on OpenShift, still write to an XFS file system the iSCSI LUN is formatted with. As a result, simple client-level read and write requests remain virtually as fast as accessing the file directly on GlusterFS, but also all the other file system operations are converted into much faster reads and writes to the file backing the block volume because they are not distributed. From the perspective of GlusterFS, it’s a constant stream of basic read and write requests, which GlusterFS is efficient at. Of course, this comes with a trade-off: gluster-block is not shared storage.

Container-Native Storage version 3.6 now provides backend storage for OpenShift Logging and OpenShift Metrics with gluster-block. For the moment, the use of gluster-block in production is only supported for OpenShift Logging and Metrics services, but use of gluster-block beyond that is under qualification, and support is expected to be extended soon.

The Logging and Metrics services have strict performance and latency requirements and are important for any OpenShift cluster in production. They provide vital information and debugging capabilities for administrators. By design, they are scale-out services, because their storage backend (ElasticSearch for Logging, Cassandra for Metrics) supports a shared-nothing approach. However, in production you do not want additional shards of ElasticSearch and Cassandra run side-by-side with your application pods. That’s why there is a concept of infrastructure nodes in OpenShift that do not run business applications but are dedicated to OpenShift infrastructure components like these. Typically, these kind of servers only have storage locally available, which is limited in capacity and performance. Thus, it might quickly become insufficient to store the logs and metrics of hundred of pods. With container-native storage, you now have a scalable, robust, and long-term storage solution for logging and metrics that utilizes the entire cluster’s storage capacity.

Support a scale-out registry

There is one additional component in OpenShift that’s crucial for operations: the container image registry. This is where all the resulting images from source-to-image builds will be pushed to and where developers can upload their custom images. If it’s unavailable, those operations will fail, and users will be unable to launch new or update existing applications.

The default configuration for the OpenShift registry is to use `emptyDir` storage, that is, a local file system on the container host that depends on the registry pod’s lifetime. In this setup, the registry, of course, cannot be scaled out, updated, or restarted on another host.

Fortunately, as of version 3.5, container-native storage allows for a scale-out registry using shared storage on a PersistentVolume served by GlusterFS. This has several advantages:

  1. No external storage is required, like NFS, which can cause problems with metadata consistency with a busy registry.
  2. There is no dependency on provider storage (e.g., AWS S3 being unavailable in a VMware environment) for shared data access.
  3. The registry can now be scaled out, ideally across all infra nodes.
  4. The registry storage backend can grow dynamically with the platform.

The beauty of this is that it can be installed like this right away. Like we’ve already covered during the announcement of OpenShift Container Platform 3.6 earlier this year, the OpenShift Advanced Installer now supports deploying container-native storage and the registry on container-native storage out of the box. See this video here for details.

All you have to do since OpenShift Container Platform 3.6 is add a few lines to your Ansible inventory file.

To deploy an OpenShift registry backed by container-native storage, first add the following variable definition in the [OSEv3:vars] section:


And then add a new host group defining the container-native storage nodes to the inventory, for example:

infra-1.lab glusterfs_devices='[ "/dev/sdd" ]'
infra-2.lab glusterfs_devices='[ "/dev/sdd" ]'
infra-3.lab glusterfs_devices='[ "/dev/sdd" ]'

This is enough to tell the OpenShift Advanced Installer that it should create a basic 3-node container-native storage cluster, in this case on the infrastructure nodes, using the supplied devices to create bricks. From this cluster a PersistentVolume will be created and supplied to the registry DeploymentConfig.

That way the registry will be launched with shared storage, provided by container-native storage, and scaled to 3 instances across the infrastructure nodes. You get a highly available and robust registry out of the box with no additional configuration needed.

S3 object storage for applications

In addition to providing block and file storage services, Container-Native Storage 3.6 now provides an S3 object storage interface as a TechPreview. Application developers have a ready-to-use REST API at hand to provide object storage to workloads on OpenShift, just a HTTP PUT or GET request away.

Object storage in Red Hat Container-Native Storage 3.6 provides a simple yet scalable storage layer for distributed applications that were previously tied to specific cloud provider S3 object storage. These application now run with little or no modification on OpenShift.

In this implementation, a gluster-s3 service is deployed as a pod in your OpenShift cluster, and an OpenShift Route is generated for it. The Route’s URL is provided to applications as their S3 endpoint. The service receives the S3 requests and translates those to file system operations on GlusterFS volumes. The S3 buckets and objects are stored as directories and files on that volume, respectively.

For now, this service can be deployed with the cns-deploy utility. There are some new command switches available for this purpose:

cns-deploy topology.json --namespace gluster-storage --log-file=cns-deploy.log --object-account dmesser --object-user dmesser --object-password redhat

The new parameters allow you to specify a name for the S3 account (object-account, an aggregate of multiple S3 buckets, one per CNS cluster), a named user (object-user), and the authentication password for that user in that account (object-password). Once all of these 3 switches are presented, cns-deploy will create the glusterfs-s3 infrastructure.

Support for doing this with the OpenShift Advanced Installer is expected to follow soon. The design foresees exactly one S3 domain/account per CNS cluster, although multiple CNS clusters can be deployed easily.

Improvements for deployment and operations

Besides a whole bunch of new features, we’ve also introduced improvements in usability to make the container-native storage experience better.

In Container-Native Storage 3.6, the cns-deploy tool has been improved in a number of ways. It is now more idempotent, allowing the administrator to run the installer multiple times without having to start from scratch. There will still be error scenarios that may require manual intervention, but it should be much easier to recover from such errors. It will also deploy the required resources to use gluster-block and gluster-s3. Combined with the idempotency improvements, administrators will be able to run cns-deploy to deploy those features into an environment that’s already running container-native storage.

Container Native Storage 3.6 also provides improved integration with container-ready storage. All of our new features will work just as well on container-ready storage as container-native storage. In addition, we have introduced support for a configuration we’re calling Container-Ready Storage without Heketi. heketi is the volume management API service for GlusterFS. In this configuration, container-ready storage runs with the usual Red Hat Gluster Storage nodes outside the OpenShift cluster, but heketi resides as a pod within OpenShift. This has the advantage of making the heketi service highly available rather than residing on a single machine. For new deployments, the cns-deploy can be used to initialize a container-ready storage cluster in this configuration.

Another common scenario that is likely to occur over time, even with the short-lived nature of some workloads, is PersistentVolumes filling to capacity. This can happen when a user under-estimates the required capacity for a workload or the pod simply runs way longer than expected. In any case, heketi now allows for online volume expansion.

To take advantage of this, simply use the heketi-client on the CLI to expand the size of any given volume:

heketi-cli volume expand --volume=0e8a8adc936cd40c2df3698b2f06bba9 --expand-size=2

In the background, heketi changes the GlusterFS volume layout from a 3-way replicated to distributed-replicated. See below for a comparison from GlusterFS perspective.

Before volume expansion:

sh-4.2# gluster vol info vol_0e8a8adc936cd40c2df3698b2f06bba9

Volume Name: vol_0e8a8adc936cd40c2df3698b2f06bba9
Type: Replicate
Volume ID: 841bd097-659b-4b5d-b3ec-56bb8cc51c2f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: on

After volume expansion:

sh-4.2# gluster vol info vol_0e8a8adc936cd40c2df3698b2f06bba9

Volume Name: vol_0e8a8adc936cd40c2df3698b2f06bba9 
Type: Distributed-Replicate 
Volume ID: 841bd097-659b-4b5d-b3ec-56bb8cc51c2f 
Status: Started 
Snapshot Count: 0 
Number of Bricks: 2 x 3 = 6 
Transport-type: tcp 
Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: on 

Finally, with Container-Native Storage 3.6, we have expanded the amount of technical documentation available. We provide more examples of things both new and pre-existing that you can do with container-native storage, as well as detailed upgrade procedures from a variety of configurations to make sure you can get the latest set of features.


The storage play for containers is an exciting space at the moment. There are many options available for customers, and Red Hat container-native storage is unique in the way it runs natively on OpenShift and provides scalable shared file, block, and object storage to business applications and container platform infrastructure.

Red Hat Ceph Storage: Object storage performance and sizing guide

Red Hat Ceph Storage is a proven, petabyte-scale, object storage solution designed to meet the scalability, cost, performance, and reliability challenges of large-scale, media-serving, savvy organizations. Designed for web-scale object storage and cloud infrastructures, Red Hat Ceph Storage delivers the scalable performance necessary for rich media and content-distribution workloads.

While most of us are familiar with deploying block or file storage, object storage expertise is less common. Object storage is an effective way to provision flexible and massively scalable data storage without the arbitrary limitations of traditional proprietary or scale-up storage solutions. Before building object storage infrastructure at scale, organizations need to understand how to best configure and deploy software, hardware, and network components to serve a range of diverse workloads. They also need to understand the performance and scalability they can expect from given hardware, software, and network configurations.

This reference architecture/performance and sizing guide describes Red Hat Ceph Storage coupled with QCT (Quanta Cloud Technology) storage servers and networking as object storage infrastructure. Testing, tuning, and performance are described for both large-object and small-object workloads. This guide also presents the results of the tests conducted to evaluate the ability of configurations to scale to host hundreds of millions of objects.

After hundreds of hours of [Test ⇒ Tune ⇒ Repeat] exercises, this reference architecture provides empirical answers to a range of performance questions surrounding Ceph object storage, such as (but not limited to):

  • What are the architectural considerations before designing object storage?
  • What networking is most performant for Ceph object storage?
  • What does performance look like with dedicated vs. co-located Ceph RGWs?
  • How many Ceph RGW nodes do I need?
  • How do I tune object storage performance?
  • What are the recommendations for small/large object workloads?
  • What should I do? I’ve got millions of objects to store.

And the list of questions goes on. You can unlock the performance secrets of Ceph object storage for your organization with the help of the Red Hat Ceph Storage/QCT performance and sizing guide.

Storage for RHV and OCP: Two Glusters on one platform

Architecture is an interesting discipline. There are whitepapers and best practices and reference architectures to offer pristine views of what your perfect deployment should look like. And then there are budgets and timelines and business requirements to derail all of that. It’s what makes this job so interesting and challenging—hacking together the best pieces of disparate and often seemingly unrelated systems to meet goals driven by six leaders whose bonuses are met by completely different metrics.

A recent project has involved combining OpenShift Container Platform (OCP), Red Hat Virtualization (RHV), and Red Hat Gluster Storage (Gluster) into a unified system with common lifecycle operations, minimized management points, and the lowest overall footprint in terms of both capital cost and TCO. The primary storage challenge here is in creating a Gluster environment to support both RHV and its VMs as well as OCP container persistent volume requirements.

Our architectural goals include:

  • Purchase a single flexible hardware platform to serve all the storage needs
  • Segregate Gluster for RHV and Gluster for OCP into separate pools for resource allocation and to avoid possible administration snafus (such as we experienced in early testing)
  • Maintain a single-point and single-method of management—one Heketi server to rule them all
  • Containerize as much as possible to keep lifecycle maintenance atomic

Our early version of the architecture had Gluster running as container-native storage (CNS) for OCP on top of RHV while also serving storage to RHV, but this proved to introduce a chicken-and-egg problem where a single failure (such as an etcd crash) could cause a cascading outage. So our redesign involved splitting Gluster off from OCP as a stand-alone system while still being a unified storage provider and leveraging container atomicity.

The approach we wanted involved containerized Gluster running on bare-metal container hosts. Fundamentally, this is actually pretty straightforward today with pre-build Gluster containers available from the Red Hat registry. What complicated this was our desire to run two separate containerized Gluster pools on the same hardware nodes.


There’s a pretty good chance that this architecture is not explicitly supported by Red Hat. While all the components we use here are definitely supported, this particular combination is untested by our engineering, QE, and performance teams. Don’t consider anything here a recommendation for how you should run your environment, only an academic study of a possible approach to solving an interesting challenge. If you have any questions, please reach out to your Red Hat sales and support teams.

The platform

We initially wanted to build this on top of Red Hat Enterprise Linux Atomic Host, but our lab environment wasn’t setup to provision this build on our systems, so we had to go forward with RHEL plus the docker packages. For a production build, we would return to using Atomic.


Gluster containers are usually configured with host networking because they need to communicate freely with each other and need to serve storage out to other systems and containers. However, with host networking, the Gluster ports are bound to all interfaces, so it is not possible to run two Gluster containers in this mode due to port conflicts. To solve this, the networks for each Gluster pool had to be segregated.

First, a VLAN sub-interface was created on each Gluster node for the storage network interface and using VLAN ID 199. There are ifcfg files to make these persistent. So each node includes a IP on the primary interface and a IP on a VLAN sub-interface. The Switch ports for the storage network interfaces have been configured for the tagged VLAN ID 199. The 802.1q kernel module (for VLANs) was set to load at boot time on each node with a /etc/modules-load.d/8021q.conf file.

Containerized Gluster


Each Gluster container needs to exist on its own interface and subnet. So leveraging the system-level network stuff done above, the two interfaces were each attached to a docker macvlan network on each node.

docker network create -d macvlan --subnet= \

-o parent=eth1 gluster-rhv-net
docker network create -d macvlan --subnet= \

-o parent=eth1.199 gluster-ocp-net


The containers were pulled down from the Red Hat registry.

docker pull registry.access.redhat.com/rhgs3/rhgs-server-rhel7
docker pull registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7

The Gluster containers need to be privileged in order to access the /dev/sdX block devices. They also need a number of local persistent volume stores in order to ensure they start up properly each time.

The container fstab file needs a persistent mount. So first we should touch these files, otherwise the gluster-startup command in the container will fail.

touch /var/lib/heketi-{rhv,ocp}/fstab

Then we can run the containers.

docker run -d --privileged=true --net=gluster-rhv-net \

--ip=  --name=gluster-rhv-1 -v /run \

-v /home/gluster-rhv-1-root:/root:z \

-v /etc/glusterfs-rhv:/etc/glusterfs:z \

-v /var/lib/glusterd-rhv:/var/lib/glusterd:z \

-v /var/log/glusterfs-rhv:/var/log/glusterfs:z \

-v /var/lib/heketi-rhv:/var/lib/heketi:z \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /dev:/dev rhgs3/rhgs-server-rhel7
docker run -d --privileged=true --net=gluster-ocp-net \

--ip= --name=gluster-ocp-1 -v /run \

-v /home/gluster-ocp-1-root:/root:z \

-v /etc/glusterfs-ocp:/etc/glusterfs:z \

-v /var/lib/glusterd-ocp:/var/lib/glusterd:z \

-v /var/log/glusterfs-ocp:/var/log/glusterfs:z \

-v /var/lib/heketi-ocp:/var/lib/heketi:z \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /dev:/dev rhgs3/rhgs-server-rhel7

Block device assignments

Running the containers in privileged mode allows them to access all system block devices. For our particular architectural needs, we intend to use from each node only one SSD for the gluster-rhv pool and the remaining five SSDs for the gluster-ocp pool.

 Gluster Pool  Block Devices
 gluster-rhv  sdb
 gluster-ocp  sdc, sdd, sde, sdf, sdg



The persistent Heketi config is being stored in the /etc/heketi directory on one of the nodes (we’ll call it node1). First, an ssh keypair is created and placed there.

ssh-keygen -f /etc/heketi/heketi_key -t rsa -N ''

Next, the heketi.json file is created. Right now, no auth is being used — obviously don’t do this in production. Note the ssh port is 2222, which is what the Gluster containers are configured to listen on.

  "_port_comment": "Heketi Server Port Number",
  "port": "8080",

  "_use_auth": "Enable JWT authorization. Please enable for deployment",
  "use_auth": false,

  "_jwt": "Private keys for access",
  "jwt": {
    "_admin": "Admin has access to all APIs",
    "admin": {
      "key": "My Secret"
    "_user": "User only has access to /volumes endpoint",
    "user": {
      "key": "My Secret"

  "_glusterfs_comment": "GlusterFS Configuration",
  "glusterfs": {
    "_executor_comment": [
      "Execute plugin. Possible choices: mock, ssh",
      "mock: This setting is used for testing and development.",
      "      It will not send commands to any node.",
      "ssh:  This setting will notify Heketi to ssh to the nodes.",
      "      It will need the values in sshexec to be configured.",
      "kubernetes: Communicate with GlusterFS containers over",
      "            Kubernetes exec api."
    "executor": "ssh",

    "_sshexec_comment": "SSH username and private key file information",
    "sshexec": {
      "keyfile": "/etc/heketi/heketi_key",
      "user": "root",
      "port": "2222"

    "_db_comment": "Database file name",
    "db": "/var/lib/heketi/heketi.db",

    "_loglevel_comment": [
      "Set log level. Choices are:",
      "  none, critical, error, warning, info, debug",
      "Default is warning"
    "loglevel" : "debug"

SSH access

The Heketi server needs passwordless SSH access to all Gluster containers on port 2222. The public key generated above needs to be added to the authorized_keys for all of the Gluster containers. Note that we have a local persistent volume (PV) for each Gluster container’s /root directory, so this authorized_key entry was simply added to each one of those.

cat /etc/heketi/heketi_key.pub >> \


NOTE: This needs to be done for each of the root home directories for each Gluster container


The single Heketi container will run on node1. It needs access to both of the subnets, so the best thing to do is run the container in host networking mode. It also needs a few persistent volumes.

docker run -d --net=host --name=gluster-heketi \

-v /etc/heketi:/etc/heketi:z -v /var/lib/heketi:/var/lib/heketi:z \



Since we are running heketi-cli on the same node that we are running the Heketi container, there is a security issue we have to work through. By default, the container host cannot directly access the local container via the IP assigned to its macvlan network interface. So on the container host node1 we need to create local macvlan interfaces for each of the subnets. Use this at runtime and the /etc/rc.d/rc.local file:

/usr/sbin/ip link add macvlan0 link eth1 type macvlan mode bridge
/usr/sbin/ip addr add dev macvlan0
/usr/sbin/ifconfig macvlan0 up

/usr/sbin/ip link add macvlan1 link eth1.199 type macvlan mode bridge
/usr/sbin/ip addr add dev macvlan1
/usr/sbin/ifconfig macvlan1 up

The rc.local file in RHEL is for legacy support, so it has to be made executable and its systemd service has to be enabled.

chmod 755 /etc/rc.d/rc.local
systemctl enable rc-local.service

Heketi CLI

The heketi-cli needs to run $somewhere. For simplicity, the RPM is installed on node1. With the container running with networking in host mode, heketi is listening on localhost port 8080. Export the environment variable in order to be able to run heketi-cli commands.

export HEKETI_CLI_SERVER=http://localhost:8080

Setting up the Heketi clusters

A JSON file is populated at /root/heketi-rhv-plus-ocp-topology.json on node1. This file defines two separate Heketi clusters with their respective Gluster nodes (containers) and block devices.

    "clusters": [
            "nodes": [
                    "node": {
                        "hostnames": {
                            "manage": [
                            "storage": [
                        "zone": 1
                    "devices": [
                    "node": {
                        "hostnames": {
                            "manage": [
                            "storage": [
                        "zone": 2
                    "devices": [
                    "node": {
                        "hostnames": {
                            "manage": [
                            "storage": [
                        "zone": 3
                    "devices": [

            "nodes": [
                    "node": {
                        "hostnames": {
                            "manage": [
                            "storage": [
                        "zone": 1
                    "devices": [
                    "node": {
                        "hostnames": {
                            "manage": [
                            "storage": [
                        "zone": 2
                    "devices": [
                    "node": {
                        "hostnames": {
                            "manage": [
                            "storage": [
                        "zone": 3
                    "devices": [

This file is passed (once) to Heketi to setup the two clusters.

heketi-cli topology load --json=heketi-rhv-plus-ocp-topology.json

It’s important to note the two different clusters. It’s not (AFAIK) possible to “name” the clusters, so we have to reference them by their UUIDs. The Gluster volumes for RHV will be created on one cluster, and those orchestrated for OCP PVs will be created on a different cluster.

RHV Gluster volumes

For the purposes of RHV, two volumes were requested—one for the Hosted Engine and one for the VM storage. These were created via heketi-cli. Note the cluster ID passed to the commands.

heketi-cli volume create --size 100 --name rhv-hosted-engine \

--clusters ae2a309d02781816adfed567693221a9
heketi-cli volume create --size 1024 --name rhv-virtual-machines \

--clusters ae2a309d02781816adfed567693221a9

These can be mounted to the RHV nodes via the subnet using the Gluster native client. Example fstab entries:      /100g   glusterfs       backupvolfile-server= 0 0      /1t   glusterfs       backupvolfile-server= 0 0

OCP PV Gluster volumes

Our OCP pods are attached to the subnet to communicate with the storage. First on node1 the Heketi API port (8080) needs to be opened in the firewall.

firewall-cmd --add-port 8080/tcp
firewall-cmd --add-port 8080/tcp --permanent

Then the storage class for OCP is defined with the below YAML. Note that we aren’t currently doing any authentication (but obviously we should). You see here that we explicitly define the Heketi cluster ID for this class in order to ensure that all volumes for PVCs are created only on the Gluster pool we have identified for OCP use.

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
 name: gluster-dyn
provisioner: kubernetes.io/glusterfs
 resturl: ""
 restauthenabled: "false"
 clusterid: "74edade536c80f14486edfbabd204151"

Then the storage class is added to OCP on the master.

oc create -f glusterfs-storageclass.yaml

From this point, PVCs (persistent volume claims) made against this storage class will interface with Heketi to dynamically provision Gluster volumes to match the claim.


Auto-start containers

Docker container systemd init scripts are tricky. I’ve found that every example on the internet is either wrong, outdated, or uses an approach I don’t like.

Below is an example systemd service file for the Heketi container, which is simple and works the way we expect it to with the docker run command in the ExecStart (/etc/systemd/system/docker-container-gluster-heketi.service). NOTE: Do not daemonize (-d) the docker run command in the init script. Also, the SuccessExitStatus is important here.

Description=Gluster Heketi Container

SuccessExitStatus=0 137
ExecStartPre=-/usr/bin/docker stop gluster-heketi
ExecStartPre=-/usr/bin/docker rm gluster-heketi
ExecStart=/usr/bin/docker run --net=host --name=gluster-heketi -v /etc/heketi:/etc/heketi:z -v /var/lib/heketi:/var/lib/heketi:z rhgs3/rhgs-volmanager-rhel7
ExecStop=/usr/bin/docker stop gluster-heketi


Reload the systemd daemon:

systemctl daemon-reload

Enable and start the service

systemctl enable docker-container-gluster-heketi

systemctl start docker-container-gluster-heketi

Known issues and TODOs

  • Security needs to be taken into account. We’ll set up appropriate key-based authentication and JWT for Heketi. We’d also like to use role-based auth. Hopefully we’ll cover this in a future blog post.
  • Likely $other_things I haven’t realized yet, or better ways of approaching this. I’d love to hear your comments.

Library of Ceph and Gluster reference architectures – Simplicity on the other side of complexity

The Storage Solution Architectures team at Red Hat develops reference architectures, performance and sizing guides, and test drives for Gluster- and Ceph-based solutions. We’re a group of architects who perform lab validation, tuning, and interoperability development for composable storage services with target workloads on optimized server and network configurations. We seek simplicity on the other side of complexity.

At the end of this blog entry is a full library of our current publications and test drives.

In our modern era, a top company asset is pivotability. Pivotability based on external market changes. Pivotability after unknowns become known. Pivotability after golden ideas become dark alleys. For most enterprises, pivotability requires a composable technology infrastructure for shifting resources to meet changing needs. Composable storage services, such as those provided by Ceph and Gluster, are part of many companies’ composable infrastructures.

Composable technology infrastructures are most frequently described by the following attributes:

  • Open source v. closed development.
  • On-demand architectures v. fixed architectures.
  • Commodity hardware v. proprietary appliances.
  • Cross-industry collaboration v. isolated single-vendor silos.

As noted in the following figure, a few companies with large staffs of in-house experts can create composable infrastructures from raw technologies. Their large investments in in-house expertise allows them to convert raw technologies into solutions with limited pre-integration by technology suppliers. AWS, Google, and Azure are all examples of DIY businesses. A larger number of other companies, also needing composable infrastructures, rely on technology suppliers and the community for solution pre-integration and guidance to reduce their in-house expertise costs. We’ll label them “Assisted DIY.” Finally, the majority of global enterprises lack the in-house expertise for deploying these composable infrastructures. They rely on public cloud providers and pre-packaged solutions for their infrastructure needs. We’ll call them “Pre-packaged.”


The reference architectures, performance and sizing guides, and test drives produced by our team are primarily focused on the “Assisted DIY” segment of companies. Additionally, we strive to make Gluster and Ceph composable storage services available to the “Pre-packaged” segment of companies by using what we learn to produce pre-packaged combinations of Red Hat software with partner hardware targeting specific workload use cases.

We enjoy our roles at Red Hat because of the many of you with whom we collaborate to produce value.  We hope you find these guides useful.

Team-produced with partner collaboration:

Partner-produced with team collaboration:

Pre-packaged solutions:

Hands-on test drives: