Why Spark on Ceph? (Part 2 of 3)

Introduction

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (this blog post)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph (Part 3 of 3)“)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Basic analytics pipeline using a Ceph object store

Our early adopter customers are ingesting, querying, and transforming data directly to and from a shared Ceph object store.  In other words, target data locations for their analytics jobs are something like “s3://bucket-name/path-to-file-in-bucket” within Ceph, instead of something like “hdfs:///path-to-file”.  Direct access to S3-compatible object stores via analytics tools like Spark, Hive, and Impala is made possible via the Hadoop S3A client.

Jointly with several customers, we successfully ran 1000s of analytics jobs directly against a Ceph object store using the following analytics tools:

Figure 1: Analytics tools tested with shared Ceph object store

In addition to running simplistic tests like TestDFSIO, we wanted to run analytics jobs which were representative of real-world workloads. To do that, we based our tests on the TPC-DS benchmark for ingest, transformation, and query jobs. TPC-DS generates synthetic data sets and provides a set of sample queries intended to model the analytics environment of a large retail company with sales operations from stores, catalogs, and the web. Its schema has 10s of tables, with billions of records in some tables. It defines 99 pre-configured queries, from which we selected the 54 most IO-intensive for out tests. With partners in industry, we also supplemented the TPC-DS data set with simulated click-stream logs, 10x larger than the TPC-DS data set size, and added SparkSQL jobs to join these logs with TPC-DS web sales data.

In summary, we ran the following directly against a Ceph object store:

  • Bulk Ingest (bulk load jobs – simulating high volume streaming ingest at 1PB+/day)
  • Ingest (MapReduce jobs)
  • Transformation (Hive or SparkSQL jobs which convert plain text data into Parquet or ORC columnar, compressed formats)
  • Query (Hive or SparkSQL jobs – frequently run in batch/non-interactive mode, as these tools automatically restart failed jobs)
  • Interactive Query (Impala or Presto jobs)
  • Merge/join (Hive or SparkSQL jobs joining semi-structured click-stream data with structured web sales data)

Architecture overview

We ran variations of the tests outlined above with 4 large customers over the past year. Generally speaking, our architecture looked something like this:

Figure 2: High-level lab architecture

Did it work?

Yes. 1000s of analytics jobs described above completed successfully. SparkSQL, Hive, MapReduce, and Impala jobs all using the S3A client to read and write data directly to a shared Ceph object store. The related architect-level blog series will document detailed lessons learned and configuration techniques.

In the final episode of this blog series, we’ll get to the punch line – what was the performance compared to traditional HDFS? For the answer, continue to Part 3 of this series….

Why Spark on Ceph? (Part 1 of 3)

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (this blog post)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph? (Part 3 of 3)”)

We’ll provide summary-level answers to these questions in a 3-part blog series. In addition, for those wanting more depth, we’ll cross-link to a separate reference architecture blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Part 1: Why would companies do this?

Agility of many, the power of one.
The agility of many analytics clusters, with the power of one shared data store.
(Ok … enough with the simplistic couplets.)

Here are a few common problems that emerged from speaking with 30+ companies:

  • Teams that share the same analytics cluster are frequently frustrated because someone else’s job often prevents their job from finishing on-time.
  • In addition, some teams want the stability of older analytic tool versions on their clusters, while their peer teams need to load the latest-and-greatest tool releases.
  • As a result, many teams demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own needs.
  • However, each separate analytics cluster typically has its own, non-shared HDFS data store – creating data silos.
  • And to provide access to the same data sets across the silos, the data platform team frequently copies datasets between the HDFS silos, trying to keep them consistent and up-to-date.
  • As a result, companies end up maintaining many separate, fixed analytics clusters (50+ in one case), each with their own HDFS data silo containing redundant copies of PBs of data, while maintaining an error-prone maze of scripts to keep data sets updated across silos.
  • But, the resulting cost of maintaining 5, 10, or 20 copies of multi-PB datasets on the various HDFS silos is cost prohibitive to many companies (both CapEx and OpEx).

In pictures, their core problems and resulting options look something like this:

Figure 1. Core problems

 

Figure 2. Resulting Options

Turns out that the AWS ecosystem built a solution for choice #3 (see Figure 2 above) years ago through the Hadoop S3A filesystem client. In AWS, you can spin-up many analytics clusters on EC2 instances, and share data sets between them on Amazon S3 (e.g. see Cloudera CDH support for Amazon S3). No more lengthy delays hydrating HDFS storage after spinning-up new clusters, or de-staging HDFS data upon cluster termination. With the Hadoop S3A filesystem client, Spark/Hadoop jobs and queries can run directly against data held within a shared S3 data store.  

Bottom-line … more-and-more data scientists and analysts are accustomed to spinning-up analytic clusters quickly on AWS with access to shared data sets, without time-consuming HDFS data-hydration and de-stage cyles, and expect the same capability on-premises.

Ceph is the #1 open-source, private-cloud object storage platform, providing S3-compatible object storage. It was (and is) the natural choice for these companies looking to provide an S3-compatible shared data lake experience to their analysts on-premises.

To learn more, continue to the next post in this series, “Why Spark on Ceph? (Part 2 of 3)”Will mainstream analytics jobs run directly against a Ceph object store?

Leverage your existing storage investments with container-native storage

By Sayandeb Saha, Director, Product Management

The Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen wide customer adoption in the past year or so. Customers are deploying it in a wide variety of environments that include bare metal, virtualized, and private and public clouds. It mimics the diverse spread of environments in which OpenShift itself gets deployed—which is also CNS’s key strength (i.e., being able to back OpenShift wherever it runs—see the following graphic).

During the past of year of customer adoption of CNS, we’ve observed some key trends that are unique for OpenShift/Kubernetes storage and that we’ll highlight in a series of blogs. This blog series will also include business and technical solutions that have worked for our customers.

In this blog post, we examine a trend where customers have adopted CNS as a storage management fabric that sits in between the OpenShift Container Platform and their classic storage gear. This particular adoption pattern continues to have a really high uptake, and there are sound business and technical reasons for doing this, which we’ll explore here.

First the Solution (The What): We’ve seen a lot of customers deploying CNS to serve out storage from their existing storage arrays/SANs and other traditional storage, as illustrated in the following graphic. In this scenario, block devices from existing storage arrays are served out with our CNS software running in VMs or containers/pods to OpenShift. In this case, the storage for the VMs that runs OpenShift is still served by the arrays.

Now the Why: Initially, it seemed backward as to why customers would be doing this; after all, software-defined storage solutions like CNS are meant to run on x86 bare metal (on premise) or in the public cloud, but further investigation revealed some interesting discoveries.

While OpenShift users and ops teams consume infrastructure, they typically do not manage infrastructure. In on-premise environments, OpenShift ops teams are highly dependent on other infrastructure teams for virtualization, storage, and operating systems for the infrastructure on which they run OpenShift. Similarly, in public clouds they consume the native compute and storage infrastructure available in these clouds.

As a consequence, they are highly dependent on storage infrastructure that is already in place. Typically, it’s very difficult to justify a storage server purchase when storage has been already procured a year or more ago from a traditional storage vendor for a new use case (OpenShift storage in this case). The issue is that this traditional storage was not designed for nor intended to be used with containers and the budget for storage has mostly been spent. This has driven the OpenShift operations teams to adopt CNS effectively as a storage management fabric that sits between their OpenShift Container Platform deployment and their existing storage array. The inherent flexibility of Red Hat Gluster Storage in this case is the form of CNS being leveraged, which enables it to aggregate and pool block devices that are attached to a VM and serve that out to OpenShift workloads. OpenShift operations teams can now have the best of both worlds. They can repurpose their existing storage array that is already in place/on premise but actually consume CNS which operates as a management fabric offering the latest and greatest in terms of feature, functionality, and manageability with a deep integration with the OpenShift platform.

In addition to business reasons, there are also various technical reasons that these OpenShift operations teams are adopting CNS. These include, but are not limited to:

  • Lack of deep integration of their existing storage arrays with OpenShift Container Platform
  • Even if their traditional storage array has rudimentary integration with OpenShift, very likely it has limited feature support, which renders it unusable with many OpenShift workloads (like lack of dynamic provisioning)
  • The roadmap of their storage arrays vendor may not match their current (or future) OpenShift/Kubernetes storage feature support needs, like lack of availability of a Persistent Volume (PV) resize feature
  • Needing a fully featured OpenShift Storage solution for OpenShift workloads as well as the OpenShift infrastructure itself. Many existing storage platforms can support one or the other, but not both. For instance, a storage array serving out Fiber Channels LUNs (plain block storage) can’t back an OpenShift registry as one needs shared storage access for it usually provided by a file or object storage back end.
  • They seek a consistent storage consumption and management experience across hybrid and multiple clouds. Once they learn to implement and manage CNS from Red Hat in one environment, it’s repeatable in all other environments. They can’t use their storage array in the public cloud.

Using CNS from Red Hat is a win for OpenShift ops teams. They can get started with a state-of-the-art storage back end for OpenShift apps and infrastructure without needing to acquire new infrastructure for OpenShift Storage right away. They have the option to move to x86-based storage servers during the following budget cycle as they grow their OpenShift footprint and onboard more apps and customers to it. The experience with CNS serves them well if they choose to implement OpenShift and CNS in other environments like AWS, Azure, and Google Cloud.

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.