What about locality?

This is the first post of a multi-part series of technical blog posts on Spark on Ceph:

  1. What about locality?
  2. Anatomy of the S3A filesystem client
  3. To the cloud!
  4. Storing tables in Ceph object storage
  5. Comparing with HDFS—TestDFSIO
  6. Comparing with remote HDFS—Hive Testbench (SparkSQL)
  7. Comparing with local HDFS—Hive Testbench (SparkSQL)
  8. Comparing with remote HDFS—Hive Testbench (Impala)
  9. Interactive speedup
  10. AI and machine learning workloads
  11. The write firehose

Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:

“Will performance suck because the benefits of locality are lost?”

It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.

Historical context

Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.

Cluster of 1800 machines, [each with]:

  • 4GB of memory
  • Dual-processor 2 GHz Xeons with hyperthreading
  • Dual 160GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth of 100 Gb/s

If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.

The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.

Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.

While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.


When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.


Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.

One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.

Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.

In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.

In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.


Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.

Why Spark on Ceph? (Part 3 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we in the Red Hat Storage solutions architecture team wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (this blog post)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series, providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Findings summary

We did Ceph vs. HDFS testing with a variety of workloads (see blog Part 2 of 3 for general workload descriptions). As expected, the price/performance comparison varied based on a number of factors, summarized below.

Clearly, many factors contribute to overall solution price. As storage capacity is frequently a major component of big data solution price, we chose it as a simple proxy for price in our price/performance comparison.

The primary factor affecting storage capacity price in our comparison was the data durability scheme used. With 3x replication data durability, a customer needs to buy 3PB of raw storage capacity to get 1PB of usable capacity. With erasure coding 4:2 data durability, a customer only needs to buy 1.5PB of raw storage capacity to get 1PB of usable capacity. The primary data durability scheme used by HDFS is 3x replication (support for HDFS erasure coding is emerging, but is still experimental in several distributions).  Ceph has supported either erasure coding or 3x replication data durability schemes for years. All Spark-on-Ceph early adopters we worked with are using erasure coding for cost efficiency reasons. As such, most of our tests were run with Ceph erasure coded clusters (we chose EC 4:2). We also ran some tests with Ceph 3x replicated clusters to provide apples-to-apples comparison for those tests.

Using the proxy for relative price noted above, Figure 1 provides an HDFS v. Ceph price/performance summary for the workloads indicated:

Figure 1: Relative price/performance comparison, based on results from eight different workloads

Figure 1 depicts price/performance comparisons based on eight different workloads. Each of the eight individual workloads was run with both HDFS and Ceph storage back-ends. The storage capacity price of the Ceph solution relative to the HDFS solution is either the same or 50% less. When the workload was run with Ceph 3x replicated clusters, the storage capacity price is shown as the same as HDFS. When the workload was run with Ceph erasure coded 4:2 clusters, the Ceph storage capacity price is shown as 50% less than HDFS. (See the previous discussion on how data durability schemes affect solution price.)

For example, workload 8 had similar performance with either Ceph or HDFS storage, but the Ceph storage capacity price was 50% of the HDFS storage capacity price, as Ceph was running an erasure coded 4:2 cluster. In other examples, workloads 1 and 2 had similar performance with either Ceph or HDFS storage and also had the same storage capacity price (workloads 1 and 2 were run with a Ceph 3x replicated cluster).

Findings details

A few details are provided here for the workloads tested with both Ceph and HDFS storage, as depicted in Figure 1.

  1. This workload was a simple test to compare aggregate read throughput via TestDFSIO. As shown in Figure 2, this workload performed comparably between HDFS and Ceph, when Ceph also used 3x replication. When Ceph used erasure coding 4:2, the workload performed better than either HDFS or Ceph 3x for lower numbers of concurrent clients (<300). With more client concurrency, however, the workload performance on Ceph 4:2 dropped due to spindle contention (a single read with erasure coded 4:2 storage requires 4 disk accesses, vs. a single disk access with 3x replicated storage.)

    Figure 2: TestDFSIO read results
  2. This workload compared the SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2.

    Figure 3: Single-user Spark query set results
  3. This workload compared Impala query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was 57% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  4. This mixed workload featured concurrent execution of a single-user running SparkSQL queries (54), 10-users each running Impala queries (54 each), and a data set merge/join job enriching TPC-DS web sales data with synthetic clickstream logs. As illustrated in Figure 1, the aggregate execution time of this mixed workload on Ceph EC4:2 was 48% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  5. This workload was a simple test to compare aggregate write throughput via TestDFSIO. As depicted in Figure 1, this workload performed, on average, 50% slower on Ceph EC4:2 compared to HDFS, across a range of concurrent client/writers. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  6. This workload compared SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2. However, price/performance was nearly comparable when running against Ceph EC4:2, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  7. This workload featured enrichment (merge/join) of TPC-DS web sales data with synthetic clickstream logs, and then writing the updated web sales data. As depicted in Figure 4, this workload was 37% slower on Ceph EC4:2 compared to HDFS. However, price/performance was favorable for Ceph, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.

    Figure 4: Data set enrichment (merge/join/update) job results
  8. This workload compared the SparkSQL query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was roughly comparable to that of HDFS, despite requiring only 50% the storage capacity costs. Price/performance for this workload thus favors Ceph by 2x. For more insight into this workload performance, see Figure 5. In this box-and-whisker plot, each dot reflects a single SparkSQL query execution time. As each of the 10-users concurrently executes 54 queries, there are 540 dots per series. The three series shown are Ceph EC4:2 (green), Ceph 3x (red), and HDFS 3x (blue). The Ceph EC4:2 box shows comparable median execution times to HDFS 3x, and shows more consistent query times in the middle 2 quartiles.
Figure 5: Multi-user Spark query set results

Bonus results section: 24-hour ingest

One of our prospective Spark-on-Ceph customers recently asked us to illustrate Ceph cluster sustained ingest rate over a 24-hour time period. For these tests, we used variations of the lab as described in blog 2 of 3. As noted in Figure 6, we measured a raw ingest rate of approximately 1.3 PiB per day into a Ceph EC4:2 cluster configured with 700 HDD data drives (Ceph OSDs).

Figure 6: Daily data ingest rate into Ceph clusters of various sizes

Concluding observations

In conclusion, below is our formative cost/benefit analysis of the above results summarizing this blog series.

  • Benefits, Spark-on-Ceph vs. Spark on traditional HDFS:
    1. Reduce CapEx by reducing duplication: Reduce PBs of redundant storage capacity purchased to store duplicate data sets in HDFS silos, when multiple analytics clusters need access to the same data sets.
    2. Reduce OpEx/risk: Eliminate costs of scripting/scheduling data set copies between HDFS silos, and reduce risk-of-human-error when attempting to maintain consistency between these duplicate data sets on HDFS silos, when multiple analytics clusters need access to the same data sets.
    3. Accelerate insight from new data science clusters: Reduce time-to-insight when spinning-up new data science clusters by analyzing data in-situ within a shared data repository, as opposed to hydrating (copying data into) a new cluster before beginning analysis.
    4. Satisfy different tool/version needs of different data teams: While sharing data sets between teams, enable users within each cluster to choose the Spark/Hadoop tool sets and versions appropriate to their jobs, without disrupting users from other teams requiring different tools/versions.
    5. Right-size CapEx infrastructure costs: Reduce over-provisioning of either compute or storage common with provisioning traditional HDFS clusters, which grow by adding generic nodes (regardless if only more CPU cores or storage capacity is needed), by right-sizing compute needs (vCPU/RAM) independently from storage capacity needs (throughput/TB).
    6. Reduce CapEx by improving data durability efficiency: Reduce CapEx of storage capacity purchased by up to 50% due to Ceph erasure coding efficiency vs. HDFS default 3x replication.
  • Costs, Spark-on-Ceph vs. Spark on traditional HDFS:

    1. Query performance: Performance of Spark/Impala query jobs ranged from 0%-131% longer execution times (single-user and multi-user concurrency tests).
    2. Write-job performance: Performance of write-oriented jobs (loading, transformation, enrichment) ranged from 37%-200%+ longer execution times. [Note: Significant improvements in write-job performance are expected when downstream distributions adopt the following upstream enhancements to the Hadoop S3A client HADOOP-13600, HADOOP-13786, HADOOP-12891].
    3. Mixed-workload Performance: Performance of multiple query and enrichment jobs concurrently executed resulted in 90% longer execution times.

For more details (and a hands-on chance to kick the tires of this solution yourself), stay tuned for the architect-level blog series in this same Red Hat Storage blog location. Thanks for reading.

Why Spark on Ceph? (Part 2 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (this blog post)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph (Part 3 of 3)“)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Basic analytics pipeline using a Ceph object store

Our early adopter customers are ingesting, querying, and transforming data directly to and from a shared Ceph object store.  In other words, target data locations for their analytics jobs are something like “s3://bucket-name/path-to-file-in-bucket” within Ceph, instead of something like “hdfs:///path-to-file”.  Direct access to S3-compatible object stores via analytics tools like Spark, Hive, and Impala is made possible via the Hadoop S3A client.

Jointly with several customers, we successfully ran 1000s of analytics jobs directly against a Ceph object store using the following analytics tools:

Figure 1: Analytics tools tested with shared Ceph object store

In addition to running simplistic tests like TestDFSIO, we wanted to run analytics jobs which were representative of real-world workloads. To do that, we based our tests on the TPC-DS benchmark for ingest, transformation, and query jobs. TPC-DS generates synthetic data sets and provides a set of sample queries intended to model the analytics environment of a large retail company with sales operations from stores, catalogs, and the web. Its schema has 10s of tables, with billions of records in some tables. It defines 99 pre-configured queries, from which we selected the 54 most IO-intensive for out tests. With partners in industry, we also supplemented the TPC-DS data set with simulated click-stream logs, 10x larger than the TPC-DS data set size, and added SparkSQL jobs to join these logs with TPC-DS web sales data.

In summary, we ran the following directly against a Ceph object store:

  • Bulk Ingest (bulk load jobs – simulating high volume streaming ingest at 1PB+/day)
  • Ingest (MapReduce jobs)
  • Transformation (Hive or SparkSQL jobs which convert plain text data into Parquet or ORC columnar, compressed formats)
  • Query (Hive or SparkSQL jobs – frequently run in batch/non-interactive mode, as these tools automatically restart failed jobs)
  • Interactive Query (Impala or Presto jobs)
  • Merge/join (Hive or SparkSQL jobs joining semi-structured click-stream data with structured web sales data)

Architecture overview

We ran variations of the tests outlined above with 4 large customers over the past year. Generally speaking, our architecture looked something like this:

Figure 2: High-level lab architecture

Did it work?

Yes. 1000s of analytics jobs described above completed successfully. SparkSQL, Hive, MapReduce, and Impala jobs all using the S3A client to read and write data directly to a shared Ceph object store. The related architect-level blog series will document detailed lessons learned and configuration techniques.

In the final episode of this blog series, we’ll get to the punch line – what was the performance compared to traditional HDFS? For the answer, continue to Part 3 of this series….

Why Spark on Ceph? (Part 1 of 3)

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (this blog post)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph? (Part 3 of 3)”)

We’ll provide summary-level answers to these questions in a 3-part blog series. In addition, for those wanting more depth, we’ll cross-link to a separate reference architecture blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Part 1: Why would companies do this?

Agility of many, the power of one.
The agility of many analytics clusters, with the power of one shared data store.
(Ok … enough with the simplistic couplets.)

Here are a few common problems that emerged from speaking with 30+ companies:

  • Teams that share the same analytics cluster are frequently frustrated because someone else’s job often prevents their job from finishing on-time.
  • In addition, some teams want the stability of older analytic tool versions on their clusters, while their peer teams need to load the latest-and-greatest tool releases.
  • As a result, many teams demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own needs.
  • However, each separate analytics cluster typically has its own, non-shared HDFS data store – creating data silos.
  • And to provide access to the same data sets across the silos, the data platform team frequently copies datasets between the HDFS silos, trying to keep them consistent and up-to-date.
  • As a result, companies end up maintaining many separate, fixed analytics clusters (50+ in one case), each with their own HDFS data silo containing redundant copies of PBs of data, while maintaining an error-prone maze of scripts to keep data sets updated across silos.
  • But, the resulting cost of maintaining 5, 10, or 20 copies of multi-PB datasets on the various HDFS silos is cost prohibitive to many companies (both CapEx and OpEx).

In pictures, their core problems and resulting options look something like this:

Figure 1. Core problems


Figure 2. Resulting Options

Turns out that the AWS ecosystem built a solution for choice #3 (see Figure 2 above) years ago through the Hadoop S3A filesystem client. In AWS, you can spin-up many analytics clusters on EC2 instances, and share data sets between them on Amazon S3 (e.g. see Cloudera CDH support for Amazon S3). No more lengthy delays hydrating HDFS storage after spinning-up new clusters, or de-staging HDFS data upon cluster termination. With the Hadoop S3A filesystem client, Spark/Hadoop jobs and queries can run directly against data held within a shared S3 data store.  

Bottom-line … more-and-more data scientists and analysts are accustomed to spinning-up analytic clusters quickly on AWS with access to shared data sets, without time-consuming HDFS data-hydration and de-stage cyles, and expect the same capability on-premises.

Ceph is the #1 open-source, private-cloud object storage platform, providing S3-compatible object storage. It was (and is) the natural choice for these companies looking to provide an S3-compatible shared data lake experience to their analysts on-premises.

To learn more, continue to the next post in this series, “Why Spark on Ceph? (Part 2 of 3)”Will mainstream analytics jobs run directly against a Ceph object store?

Leverage your existing storage investments with container-native storage

By Sayandeb Saha, Director, Product Management

The Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen wide customer adoption in the past year or so. Customers are deploying it in a wide variety of environments that include bare metal, virtualized, and private and public clouds. It mimics the diverse spread of environments in which OpenShift itself gets deployed—which is also CNS’s key strength (i.e., being able to back OpenShift wherever it runs—see the following graphic).

During the past of year of customer adoption of CNS, we’ve observed some key trends that are unique for OpenShift/Kubernetes storage and that we’ll highlight in a series of blogs. This blog series will also include business and technical solutions that have worked for our customers.

In this blog post, we examine a trend where customers have adopted CNS as a storage management fabric that sits in between the OpenShift Container Platform and their classic storage gear. This particular adoption pattern continues to have a really high uptake, and there are sound business and technical reasons for doing this, which we’ll explore here.

First the Solution (The What): We’ve seen a lot of customers deploying CNS to serve out storage from their existing storage arrays/SANs and other traditional storage, as illustrated in the following graphic. In this scenario, block devices from existing storage arrays are served out with our CNS software running in VMs or containers/pods to OpenShift. In this case, the storage for the VMs that runs OpenShift is still served by the arrays.

Now the Why: Initially, it seemed backward as to why customers would be doing this; after all, software-defined storage solutions like CNS are meant to run on x86 bare metal (on premise) or in the public cloud, but further investigation revealed some interesting discoveries.

While OpenShift users and ops teams consume infrastructure, they typically do not manage infrastructure. In on-premise environments, OpenShift ops teams are highly dependent on other infrastructure teams for virtualization, storage, and operating systems for the infrastructure on which they run OpenShift. Similarly, in public clouds they consume the native compute and storage infrastructure available in these clouds.

As a consequence, they are highly dependent on storage infrastructure that is already in place. Typically, it’s very difficult to justify a storage server purchase when storage has been already procured a year or more ago from a traditional storage vendor for a new use case (OpenShift storage in this case). The issue is that this traditional storage was not designed for nor intended to be used with containers and the budget for storage has mostly been spent. This has driven the OpenShift operations teams to adopt CNS effectively as a storage management fabric that sits between their OpenShift Container Platform deployment and their existing storage array. The inherent flexibility of Red Hat Gluster Storage in this case is the form of CNS being leveraged, which enables it to aggregate and pool block devices that are attached to a VM and serve that out to OpenShift workloads. OpenShift operations teams can now have the best of both worlds. They can repurpose their existing storage array that is already in place/on premise but actually consume CNS which operates as a management fabric offering the latest and greatest in terms of feature, functionality, and manageability with a deep integration with the OpenShift platform.

In addition to business reasons, there are also various technical reasons that these OpenShift operations teams are adopting CNS. These include, but are not limited to:

  • Lack of deep integration of their existing storage arrays with OpenShift Container Platform
  • Even if their traditional storage array has rudimentary integration with OpenShift, very likely it has limited feature support, which renders it unusable with many OpenShift workloads (like lack of dynamic provisioning)
  • The roadmap of their storage arrays vendor may not match their current (or future) OpenShift/Kubernetes storage feature support needs, like lack of availability of a Persistent Volume (PV) resize feature
  • Needing a fully featured OpenShift Storage solution for OpenShift workloads as well as the OpenShift infrastructure itself. Many existing storage platforms can support one or the other, but not both. For instance, a storage array serving out Fiber Channels LUNs (plain block storage) can’t back an OpenShift registry as one needs shared storage access for it usually provided by a file or object storage back end.
  • They seek a consistent storage consumption and management experience across hybrid and multiple clouds. Once they learn to implement and manage CNS from Red Hat in one environment, it’s repeatable in all other environments. They can’t use their storage array in the public cloud.

Using CNS from Red Hat is a win for OpenShift ops teams. They can get started with a state-of-the-art storage back end for OpenShift apps and infrastructure without needing to acquire new infrastructure for OpenShift Storage right away. They have the option to move to x86-based storage servers during the following budget cycle as they grow their OpenShift footprint and onboard more apps and customers to it. The experience with CNS serves them well if they choose to implement OpenShift and CNS in other environments like AWS, Azure, and Google Cloud.

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.

Red Hat Summit 2018—It’s a wrap!

By Will McGrath, Product Marketing, Red Hat Storage

Wowzer! Red Hat Summit 2018 was a blur of activity. The quality and quantity of conversations with customers, partners, industry analysts, community members, and Red Hatters was unbelievable. This event has grown steadily the past few years to over 7,000 registrants this year. From a Storage perspective, this was the largest presence ever in terms of content and customer interaction.

Key announcements

For Storage, we made two key announcements during Red Hat Summit. The first was around Red Hat Storage One, a pre-configured offering engineered with our server partners, announced last week. If you didn’t catch Dustin Black’s blog post that goes into the detail of the solution, check it out .

The second announcement, which occurred this week, highlighted the momentum in building a storage offering that provides a seamless developer experience and unified orchestration for containers. There are now more than 150 customers worldwide that have adopted Red Hat’s container-native storage solutions to enable their transition to the hybrid cloud, including Vorwerk and innovation award winner Lufthansa Technik.  

We featured a number of customer success stories, including Massachusetts Open Cloud, which worked with Boston Children’s Hospital to redefine medical-image processing using Red Hat Ceph Storage.

If you’d like to keep up on the containers news, check out our blog post from Tuesday and this week’s news around CoreOS integration into Red Hat OpenShift. You might also like to check out the news around customers deploying OpenShift on Red Hat infrastructureincluding OpenStackthrough container-based application development and tightly integrated cloud technologies.

Storage expertise on display

On the morning of the first day of Summit, Burr Sutter and team demoed a number of technologies, including Red Hat Storage, to showcase application portability across the open hybrid cloud. This morning, Erin Boyd and team ran some way cool live demos that showed the power of microservices and functions with OpenShift, Storage, OpenWhisk, Tensorflow, and a host of technologies across the hybrid cloud.

For those who had the opportunity to attend any of the 20+ Red Hat Summit storage sessions, you were able to learn how our Red Hat Gluster Storage and Red Hat Ceph Storage products appeal to both traditional and modern users. The roadmap presentations by both Neil Levine (Ceph) and Sayan Saha (Gluster and container-native storage) were very popular. Sage Weil, the creator of Ceph, gave a standing-room only talk on the future of storage. Some of these storage sessions will be available on the Red Hat Summit YouTube channel in the coming weeks.

We also had several partners demoing their combined solutions with Red Hat Storage, including Intel, Mellanox, Penguin Computing, QCT, and Supermicro. Commvault had a guest appearance during Sean Murphy’s Red Hat Hyperconverged Infrastructure talk, explaining what led them to decide to include it in their HyperScale Appliances and Software offerings.

This year, we conducted an advanced Ceph users’ group meeting the day before the conference with marquee customers participating in deep-dive discussions with product and community leaders. During the conference, the storage lockers have been a hit. We had great presence on the show floor, including the community booths. Our breakfast was well attended with over a hundred people registered and featured a panel of customers and partners.

Continue the conversation

During his appearance on The Cube by Silicon Angle, Red Hat Storage VP/GM Ranga Rangachari talked about his point of view on “UnStorage.” This idea, triggered by his original blog post on the subject, made quite a few waves at the event. Customers and analysts are responding positively to the idea of a new approach to storage in the age of hybrid cloud, hyperconvergence, and containers. Today is the last day to win prizes by tweeting  @RedHatStorage with the hashtag #UnStorage.

If you missed us in San Francisco, we’ll be at OpenStack Summit in Vancouver from May 21-24. Red Hat is a headline sponsor at Booth A19. If you’re attending, come check out our OpenStack and Ceph demo, and check back on our blog page for news from the event. We’ll also be hosting the “Craft Your Cloud” event on Tuesday, May 22, from 6-9 pm at Steamworks in Vancouver. For more information and to register, click here. For more fun and networking opportunities, join the Ceph and RDO communities for a happy hour on May 23 from 6-8 pm at The Portside Pub in Vancouver. For more information and to register for that event, click here.

On to Red Hat Summit 2019

You can check out the videos and keynotes from Red Hat Summit 2018 on demand. Next year, Red Hat Summit is being held in Boston againit’s been rotating between San Francisco and Bostonso if you couldn’t attend San Francisco this year we urge you to plan to visit us in Boston next year. We hope you enjoyed our coverage of Red Hat Summit 2018, and hope to see you in 2019.

More accolades for Red Hat Ceph Storage

By Daniel Gilfix, Product Marketing, Red Hat Storage

Once again, an independent analytic news source has confirmed what many of you already know: that Red Hat Ceph Storage stands alone in its commitment to technical excellence for the customers it serves. In the latest IT Brand Pulse survey covering Networking & Storage products, IT professionals from around the world have selected Red Hat Ceph Storage as the “Scale-out Object Storage Software” leader in all categories. This includes price, performance, reliability, service and support, and innovation. The honors follow a pattern of recognition from IT Brand Pulse, having bestowed the leadership tag to Red Hat Ceph Storage in 2017, 2015, and 2014, with 2016 noted for Red Hat as “Service and Support” leader.

The report documented the results of the independent, March 2018, annual survey that polled vendors on their perception of excellence in eleven different categories. Red Hat Ceph Storage earned ratings that were visibly head and shoulders above the competition, including more than a 2X differential over Scality and VMware.

Source: IT Brand Pulse, https://itbrandpulse.com/it-pros-vote-2018-networking-storage-brand-leaders/

It feels like just yesterday!

This latest third party validation comes on the heels of Red Hat Ceph Storage being named as a finalist in Storage Magazine and SearchStorage’s 2017 Products of the Year competition in late January 2018. Here, the evaluation was based on Red Hat Ceph Storage v2.3, one that made great strides in the areas of connectivity and containerization, including an NFS gateway to an S3-compatible object interface and compatibility with the Hadoop S3A plugin.

Red Hat Ceph Storage 3 carries the baton

IT professionals voting in this year’s IT Brand Pulse survey were able to consider newer features in the important Red Hat Ceph Storage 3 release that addressed a series of major customer challenges in object storage and beyond. We delivered full support for file-based access via CephFS, expanded ties to legacy storage environments through iSCSI, pumped fuel into our containerization options with CSDs for 25% hardware deployment savings, and introduced an easier monitoring interface and additional layers of automation for more self-maintaining deployments.  

See you at Red Hat Summit!

Ceph booth at Red Hat Summit 2018

As usual, the real testament to our success is the continued satisfaction of our customer base, the ones who are increasingly choosing Red Hat Ceph Storage for modern use cases like AI and ML, rich media, data lakes, hybrid cloud infrastructure based on OpenStack, and traditional backup and restore.

Ceph user group at Red Hat Summit 2018

We look forward to discussing deployment options and whether Red Hat Ceph Storage might be right for you this week at Red Hat Summit—There’s still so much more to go! Catch us at one of the following sessions in Moscone West:

Today (Wednesday, May 9)

Tomorrow (Thursday, May 10)

Container-native storage from Red Hat is on a roll at Red Hat Summit 2018!

By Steve Bohac, Product Marketing, Red Hat Storage

It’s Red Hat Summit week, with this year’s edition taking place in San Francisco! As always, Red Hat has a plethora of announcements this week.

If you haven’t already heard the news, yesterday we announced substantial customer adoption momentum with container-native storage from Red Hat. Customers such as Lufthansa Technik, Aragonesa de Servious Telematico (AST), Generali Switzerland, IHK-GfI, and Vorwerk (amongst many more) are using Red Hat OpenShift Container Platform for cloud-native applications and are representative of how organizations are seeking out scalable, fully integrated, developer friendly storage for containers.

Based on Red Hat Gluster Storage, container-native storage from Red Hat offers these organizations scalable, persistent storage for containers across hybrid clouds with increased application portability. Tightly integrated with Red Hat OpenShift Container Platform, container-native storage from Red Hat can be used to persist not only application data but data for logging, metrics, and the container registry. The deep integration with Red Hat OpenShift Container Platform helps developers easily provision and manage elastic storage for applications and offers a single point of support. Customers use container-native storage to persist data for a variety of applications, including SQL and NoSQL databases, CI/CD tools, web serving, and messaging applications.

Organizations using container-native storage from Red Hat can benefit from simplified management, rapid deployment, and a single point of support. The versatility of container-native storage from Red Hat can enable customers to run cloud-native applications in containers, on bare metal, in virtualized environments, or in the public cloud.

For those of you attending Red Hat Summit this week, as always we know you love breakout sessions to learn more about Red Hat solutions—and we have a bunch covering container-native storage from Red Hat! Don’t forget to get your raffle tickets at each of the storage sessions you attend. Here’s what the line up for container-native storage from Red Hat sessions looks like:

(All in Moscone West unless otherwise noted)

Tuesday, May 8

Thursday, May 10

Want to learn more?

For hands-on experience combining OpenShift and container-native storage, check out our test drive, a free, in-browser lab experience that walks you through using both.

Happy Red Hat Summit! Hope to see you this week!




Five ways to experience UnStorage at Red Hat Summit

Welcome to Red Hat Summit 2018 in San Francisco! The Storage team has been hard at work to make this the best possible showcase of technology and customers—and have fun while doing it. This year our presence is built around the theme: UnStorage for the modern enterprise.

What is UnStorage?

Today’s users need their data so accessible, so scalable, so versatile that the old concept of “storing” it seems counterintuitive. Perhaps a better way of describing the needs of the modern enterprise is UnStorage, as outlined in this blog post by Red Hat Storage VP and GM, Ranga Rangachari.

Five ways to experience UnStorage at Red Hat Summit

  1. Content is king: We have 24 sessions packed with storage knowledge, best practices, and success stories. Over 21 Red Hat Storage customers will be featured at the event, including on a panel at our breakfast (open to all attendees) on Wednesday at 7 am at the Marriott Marquis. Learn how some of the most innovative enterprises leverage the power of unStorage to solve their scale and agility challenges.
  2. Without hardware partners, it’s like clapping with one hand: By definition, the success of software-defined storage hinges on the strength of the hardware ecosystem. Since the storage controller software is only half the solution, it’s important to have deep engineering investment with hardware and component vendors to build rock-solid solutions for customers. With partners like Supermicro, Mellanox, Penguin Computing, Intel, Commvault, and QCT, all featured at the conference, Red Hat Storage enables greater customer choice and openness—a key tenet of UnStorage.
  3. Explore your storage curiosity: UnStorage is all about breaking the rules to make things better. You’ll find a lot of creative ideas that are off the beaten track. Just as UnStorage is ubiquitous—it stretches across private, public, and hybrid cloud boundaries—it’s hard to miss Storage at the conference. You can find storage lockers near the expo entrance where you can drop off backpacks and charge phones while you attend sessions. Or enter to win one of two Star Wars collector edition drones by attending sessions or visiting the booth. Stop by the Storage Launch Pad to play online games, take surveys, and pick up a ton of giveaways, including two golden tickets handed out every day, which will afford you a special set of prizes.
  4. Test drive storage: Kick the tires on UnStorage with one of three test drives for Ceph, Gluster, and OpenShift Ops. As the name suggests, software-defined storage is completely decoupled from hardware, making it easy to test and deploy in the cloud. On the other side of the deployment spectrum, you can also try out the sizing tool for Red Hat Storage One, our single SKU pre-configured system announced last week. Stop by one of four Storage pods on the expo floor for demos and conversations with Storage experts.
  5. The proof of the pudding: Stop by Thursday’s keynote with CTO Chris Wright and live demos by Burr Sutter and team featuring container-native storage baked into Red Hat platforms such as OpenShift. UnStorage is as invisible as it is pervasive. Modern enterprises demand that storage be fully integrated into compute platforms for easier management and scale. With container-native storage surpassing 150 customers in the last year alone, learn how customers such as Schiphol, FICO, and Macquarie Bank are building next-generation hybrid clouds with Red Hat technologies.

We’re not all-work-all-the-time at Red Hat Storage, though. Join us at the community happy hour or the hybrid cloud infrastructure party on Tuesday to blow off some steam during a long week. Our social media strategist, Colleen Corrice, is running a way cool Twitter contest: All you have to do is post a picture at a Storage session or booth @RedHatStorage with the hashtag #UnStorage to receive a T-shirt and be included in a drawing for a personal planetarium.

Finally, check out this infographic on all things UnStorage @ Red Hat Summit. Please check back for a daily blog through this week. We hope to see you at Red Hat Summit 2018.

Introducing Red Hat Storage One

More than a year ago, our Storage Architecture team set out to answer the question of how we can overcome the last barriers to software-defined storage (SDS) adoption. We know from our thousands of test cycles and hundreds of hours of data analysis that a properly deployed Gluster or Ceph system can easily compete with—and often surpass—the feature and performance capabilities of any proprietary storage appliance, usually at a fraction of the cost based on our experience and rigorous study. We have many customer success stories to back up these claims with real-word deployments for enterprise workloads. However, one piece of feedback is consistent and clear: Not every customer is willing or prepared to commit the resources necessary to architect a system to the best standards for their use case. The barrier to entry is simply higher than for a comparative proprietary appliance that is sold in units of storage with specific workload-based performance metrics. The often-lauded flexibility of SDS is, in these cases, its Achilles’ heel.

Continue reading “Introducing Red Hat Storage One”

Container-native storage 3.9: Enhanced day-2 management and unified installation within OpenShift Container Platform

By  Annette Clewett, Humble Chirammal, Daniel Messer, and Sudhir Prasad

With today’s release of Red Hat OpenShift Container Platform 3.9, you will now have the convenience of deploying Container-Native Storage (CNS) 3.9 built on Red Hat Gluster Storage as part of the normal OpenShift deployment process in a single step. At the same time, major improvements in ease of operation have been introduced to give you the ability to monitor provisioned storage consumption, expand persistent volume (PV) capacity without downtime, and use a more intuitive naming convention for persistent volume names.

Continue reading “Container-native storage 3.9: Enhanced day-2 management and unified installation within OpenShift Container Platform”

Product of the Year finalist

By Daniel Gilfix, Red Hat Storage

For the second year in a row, Red Hat Ceph Storage has been named as a finalist in Storage Magazine and SearchStorage’s 2017 Products of the Year competition. Whereas in 2016, the honor was bestowed upon what was arguably the most important product release since Ceph came aboard the Red Hat ship, this year’s candidate was Red Hat Ceph Storage 2.3, a point release. This means a lot to us, but as a reader—perhaps a current or prospective customer, why should you care?

Excellent question, I must say, since normally we don’t like to boast. Our focus here at Red Hat is on the needs, experiences, and ultimate satisfaction of those who use our solutions. And given the evolution of Red Hat Ceph Storage from its acquisition from Inktank, the storage vendor, to Red Hat, the IT vendor, one would hope that we’re making progress.

Source: Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Conflict_Resolution_in_Human_Evolution.jpg

The fact that Red Hat Ceph Storage 2.3 was recognized as among those reflecting the latest trends in flash, cloud, and container technologies is a good sign that this is true. More important validation, however, comes from customers like Produban and UK Cloud, who are incorporating the product into broad Red Hat solutions. It also comes from those like Monash University and CLIMB, who can appreciate improvements to versatility, connectivity, and flexibility, like the NFS gateway to an S3-compatible object interface, compatibility testing with the Hadoop S3A plugin, and a containerized version of the product.

Even more uplifting from a user perspective today is the fact that v2.3 has already been superseded by Red Hat Ceph Storage 3, a more substantive advance into the realm of object storage that addresses a few key customer requirements while making adoption less challenging. For example, the product rounded its value as a cost and resource-saving unified storage platform with full support for file-based access (CephFS) and links to legacy storage environments through iSCSI. Containerization was advanced to include CSDs, enabling nearly 25% hardware deployment savings and more predictable performance through the co-location of storage daemons. And we added a snazzy new monitoring interface and additional layers of automation to make deployments more self-maintaining. According to Olivier Delachapelle, Head of Data Center Category Management EMEIA at Fujitsu, “Red Hat Ceph Storage 3 is probably the most advanced software-defined storage solution combining extreme scalability, inherent disaster resilience, and significant price-capacity value.     

Snapshot of Red Hat Ceph Storage management console, top-level interface

In the end, we feel good about the public recognition, but we feel even better when our customers and partners are happy and have what they need to succeed. I encourage you to share your thoughts about where we’re on target and/or perhaps missing the boat. Ultimately, being part of an IT company means our storage solution can serve a role that was perhaps unimaginable before, and it supports our commitment to real-world deployment of the future of storage.