Red Hat OpenShift Container Platform 3.10 with Container-Native Storage 3.9

This post documents how to install Container-Native Storage 3.9 (CNS 3.9) with OpenShift Container Platform 3.10 (OCP 3.10). CNS provides persistent storage for OCP’s general-application consumption and for the registry.

CNS 3.9 installation with OCP 3.10 advanced installer

The deployment of CNS 3.9 can be accomplished using openshift-ansible playbooks and specific inventory file options. The first group of hosts in glusterfs specifies a cluster for general-purpose application storage and will, by default, come with the StorageClass glusterfs-storage to enable dynamic provisioning. For high availability of storage, it’s very important to have four nodes for the general-purpose application cluster, glusterfs. The second group, glusterfs_registry, specifies a cluster that will host a single, statically deployed PersistentVolume for use exclusively by a hosted registry that can scale. This cluster will not offer a StorageClass for file-based PersistentVolumes with the options and values as they are currently configured.

Following is an example of a partial inventory file with selected options concerning deployment of CNS 3.9 for applications and registry. When using options for deployment with values of specific sizes, (e.g., openshift_hosted_registry_storage_volume_size=10Gi) or node selectors, (e.g., they should be adjusted for your particular deployment needs.


# registry

# Container image to use for glusterfs pods
# Container image to use for gluster-block-provisioner pod
# Container image to use for heketi pods
# CNS storage cluster for applications

# CNS storage cluster for OpenShift infrastructure

[nodes] openshift_node_group_name="node-config-compute" openshift_node_group_name="node-config-compute" openshift_node_group_name="node-config-compute" openshift_node_group_name="node-config-compute" openshift_node_group_name="node-config-compute" openshift_node_group_name="node-config-compute" openshift_node_group_name="node-config-compute"

[glusterfs] glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]' glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]' glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]' glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs_registry] glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]' glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]' glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'

CNS 3.9 uninstall

With this release, the uninstall.yml playbook can be used to remove all gluster and heketi resources. This might come in handy when there are errors in inventory file options that cause the gluster cluster to deploy incorrectly.

If you’re removing a CNS installation that is currently being used by any applications, you should remove those applications before removing CNS, because they will lose access to storage. This includes infrastructure applications like registry.

If you have the registry using a glusterfs PersistentVolume, remove it with the following command:

oc delete deploymentconfig docker-registry
oc delete pvc registry-claim
oc delete pv registry-volume
oc delete service glusterfs-registry-endpoints

If running the uninstall.yml because a deployment failed, run the uninstall.yml playbook with the following variables to wipe the storage devices for both glusterfs and glusterfs_registry clusters before trying the CNS installation again:

ansible-playbook -i <path_to_inventory file> -e
"openshift_storage_glusterfs_wipe=True" -e 

CNS 3.9 post installation for applications and registry

You can add CNS clusters and resources to an existing OCP install using the following command. This same process can be used if CNS has been uninstalled due to errors.

ansible-playbook -i <path_to_inventory_file> 

After the new cluster(s) is created and validated, you can deploy the registry using a newly created glusterfs ReadWriteMany volume. Run this playbook to create the registry resources:

ansible-playbook -i <path_to_inventory_file> 

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also watch this short video explaining why use CNS with OpenShift.

Introducing OpenShift Container Storage: Meet the new boss, same as the old boss!

By Steve Bohac, Product Marketing

Today, we’re introducing Red Hat OpenShift Container Storage 3.10.

Is this product new to you? It surely is—that’s because with the announcement today of Red Hat OpenShift Container Platform 3.10, we’ve rebranded our container-native storage (CNS) offering to now be referred to as Red Hat OpenShift Container Storage. This is still the same product with the strong customer momentum we announced a few months ago during Red Hat Summit week.Why the new name? “Red Hat OpenShift Container Storage” better reflects the product offering and its strong affinity with Red Hat OpenShift Container Platform. Not only does it install with OpenShift (via Red Hat Ansible), it’s developed, qualified, tested, and versioned coincident with OpenShift Container Platform releases. This product name best reflects that strong integration. Again, the product itself didn’t change in any way—all that’s changed is the product name.

Red Hat OpenShift Container Storage enables application portability and a consistent user experience across the hybrid cloud.

This new release, Red Hat OpenShift Container Storage 3.10, is the follow-on to Container-Native Storage 3.9 and introduces three important features for container-based storage with OpenShift: (1) arbiter volume support enabling high availability with efficient storage utilization and better performance, (2) enhanced storage monitoring and configuration visibility using the OpenShift Prometheus framework, and (3) block-backed persistent volumes (PVs) now supported for general application workloads in addition to supporting OCP infrastructure workloads.

If you haven’t already bookmarked our Red Hat Storage blog, now would be a great time! Over the coming weeks, we will be publishing deeper discussions on OpenShift Container Storage. In the meantime, though, for a more thorough understanding of OpenShift Container Storage, check out these recent technical blogs describing in depth the value of our approach to storage for containers:

Want to learn more?

For more information on OpenShift Container Storage, click here. Also, you can find the new Red Hat OpenShift Container Storage datasheet here.

For hands-on experience combining OpenShift and OpenShift Container Storage, check out our test drive, a free, in-browser lab experience that walks you through using both.

For more general information around storage for containers, check out our Container Storage for Dummies book.

Storing tables in Ceph object storage


In one of our previous posts, Anatomy of the S3A filesystem client, we showed how Spark can interact with data stored in a Ceph object storage in the same fashion it would interact with Amazon S3. This is all well and good if you plan on exclusively writing applications in PySpark or Scala, but wouldn’t it be great to allow anyone who is familiar with SQL to interact with data stored in Ceph?

That’s what SparkSQL is for, and while Spark has the ability to infer schema, it’s a lot easier if the data is already described in a metadata service like the Hive Metastore. The Hive Metastore stores table schema information, statistics on tables and partitions, and generally aids the query planners of various SQL engines query planners in constructing efficient query plans. So, regardless of whether you’re using good ol’ Hive, SparkSQL, Presto, or Impala, you’ll still be storing and retrieving metadata from a centralized store. Even if your organization has standardized on a single query engine, it still makes sense to have a centralized metadata service, because you’ll likely have distinct workload clusters that will want to share at least some data sets.


The Hive Metastore can be housed in a local Apache Derby database for development and experimentation, but a more production-worthy approach would be to use a relational database like MySQL, MariaDB, or Postgres. In the public cloud, a best practice is to store the database tables on a distinct volume to get features like snapshots, and the ability to detach and reattach it to a different instance. In the private cloud, where OpenStack reigns supreme, most folks have turned to Ceph to provide block storage. To learn more about how to leverage Ceph block storage for database workloads, I suggest taking a look at the MySQL reference architecture we authored in conjunction with the open source database experts over at Percona.

While you can configure Hive, Spark, or Presto to interact directly with the MySQL database containing the Metastore, interacting with the Hive Server 2 Thrift service provides better concurrency and an improved security posture. Overall, the general idea is depicted in the following diagram:

Storing tabular data as objects

In a greenfield environment where all data will be stored in the object store, you could simply set hive.metastore.warehouse.dir to a S3A location a la s3a://hive/warehouse. If you haven’t already had a chance to read our Anatomy of the S3A filesystem client post, you should take a look if you’re interested in learning how to configure S3A to interact with a local Ceph cluster instead of Amazon S3. When a S3A location is used as the Metastore warehouse directory, all tables that are created will default to being stored in that particular bucket, under the warehouse pseudo directory. A better approach is to utilize external locations to map databases, tables, or simply partitions to different buckets – perhaps so they can be secured with distinct access controls or other bucket policy features. An example of including a external location specification during table creation might be:

create external table inventory
   inv_date_sk bigint,
   inv_item_sk bigint,
   inv_warehouse_sk bigint,
   inv_quantity_on_hand int
row format delimited fields terminated by ‘|’
location ‘s3a://tpc/inventory’;

That’s it, when you interact with this inventory table, data will be  directly read from the object store by way of the S3A filesystem client. One of the cool aspects of this approach is the location is abstracted away, you can write queries that scan tables with different locations, or even scan a single table with multiple locations. In this fashion, you might have recent data partitions with a MySQL external location, and data older than the current week in partitions with external locations that point to object storage. Cool stuff!

Serialization, partitions, and statistics

We all want to be able to analyze data sets quickly, and there are a number of tools available to help realize this goal. The first is using different serialization formats. In my discussions with customers, the two most common serialization formats are the columnar formats ORC and Parquet. The gist of these formats is that instead of requiring complete scans of entire files, columns of data are separated into stripes and metadata describing each column’s stripe offsets are stored in a file header or footer. When a query is planned, requests can read in only the stripes that are relevant to that particular query. For a more on different serialization formats, and their relative performance, I highly suggest this analysis by our friends over at Silicon Valley Data Science. We have seen great performance with both Parquet and ORC when used in conjunction with a Ceph object store. Parquet tends to be slightly faster, while ORC tends to use slightly less disk space. This small delta might simply be the result of these formats using different compression algorithms by default (snappy vs ZLIB). Speaking of compression, it’s really easy to think you’re using it, when you are in fact not. Make sure to verify that your tables are actually being compressed. I suggest including the compression specification in table creation statements instead of hoping the engine you are using has the defaults configured the way you want.

In addition to serialization formats, it’s important to consider how your tables are partitioned, and how many files you have per partition. All S3 API calls are RESTful, which means they are heavier weight than HDFS RPC calls. Having fewer larger partitions, with fewer files per partition, will definitely translate into higher throughput and reduced query latency. If you already have tables with loads of partitions, and many files per partition, it might be worthwhile to consolidate them with larger partitions with a fewer files each as you move them into object storage.

With data serialized and partitioned intelligently, queries can be much more efficient, but there is a third way you can help the query planner of your execution engine do its job better – table and column statistics. Table statistics can be collected with ANALYZE TABLE table COMPUTE STATISTICS statements, which count the number of rows for a particular table and their partitions. The row counts are stored in the Metastore, and can be used by other engines that interrogate the Metastore during query planning.

To the cloud!

Getting cloudy

Many modern enterprises have initiatives underway to modernize their IT infrastructures, and today that means moving workloads to cloud environments, whether they be public or private. On the surface, moving data platforms to a cloud environment shouldn’t be a difficult undertaking: Leverage cloud APIs to provision instances, and use those instances like their bare-metal brethren. For popular analytics workloads, this means running storage services in those instances that are specific to analytics and continuing  a siloed approach to storage. This is the equivalent of lift and shift for data-intensive apps, a shortcut approach undertaken by some organizations when migrating an enterprise app to a cloud when they don’t have the luxury of adopting a more contemporary application architecture.

The following data platform principles pertain to moving legacy data platforms to either a public cloud or a private cloud. The private cloud storage platform discussed is Ceph, of course, a popular open-source storage platform used in building private clouds for a variety of data-intensive workloads, including MySQL DBaaS and Spark/Hadoop analytics-as-a-service.


Elasticity is one of the key benefits of cloud infrastructure, and running storage services inside your instances definitely cramps your ability to take advantage of it. For example, let’s say you have an analytics cluster consisting of 100 instances and the resident HDFS cluster has a utilization of 80 percent. Even though you could terminate 10 of those instances and still have sufficient storage space, you would need to rebalance the data, which is often undesirable. You will also be out of luck if months later you also realize that you’re only using half the compute resources of that cluster. If the infrastructure teams make a new instance flavor available, say with fancy GPUs for your hungry machine-learning applications, it’ll be much harder to start consuming them if it entails the migration of storage services.

This is why companies like Netflix decided to use object storage as the source of truth for the analytics applications, as detailed in my previous post What about locality? It enables them to expand, and contract, workload-specific clusters as dictated by their resource requirements. Need a quick cluster with lots of nodes to chew through a one-time ETL? No problem. Need transient data labs for data scientists by day only to relinquish those resources for use for reporting after hours? Easy peasy.

Data infrastructure

Before departing on the journey to cloudify an organization’s data platform architecture, an important first step is assessing the capabilities of the cloud infrastructure, public or private, you intend to consume to make sure it provides the features that are most important to data-intensive applications. World-class data infrastructure provides their tenants with a number of fundamental building blocks that lend power and flexibility to the applications that will sit atop them.

Persistent block storage

Not all data is big, and it’s important to provide persistent block storage for data sets that are well served by database workhorses like MySQL and Postgres. An obvious example is the database used by the Hive metastore. With all these workload clusters being provisioned and deprovisioned, it’s often desirable to have them interact with a common metadata service. For more details about how persistent block storage fits into the dizzying array of architectural decisions facing database administrators, I suggest a read of our MySQL reference architecture.

I also suggest infrastructure teams learn how to collapse persistent block storage performance and spatial capacity into a single dimension, all while providing deterministic performance. For this, I recommend watching the session I gave with several of my colleagues at Red Hat Summit last year.

Local SSD

Sometimes we need to access data fast, really fast, and the best way to realize that is with locally attached SSDs. Most clouds make this possible with special instance flavors, modeled after the i3 instances provided by Amazon EC2. In OpenStack, the equivalent would be instances where the hypervisor uses PCIe passthrough for NVMe devices. These devices are best leveraged by applications that handle their own replication and fault tolerance, good examples being Cassandra and Clustered Elasticsearch. Fast local devices are also useful for scratch space for intermediate shuffle data that doesn’t fit in memory, or even S3A buffer files.


Machine learning frameworks like TensorFlow, Torch, and Caffe can all benefit from GPU acceleration. With the burgeoning popularity of these frameworks, it’s important that infrastructure cater to them by providing instances flavors infused with GPU goodness. In OpenStack, this can be accomplished by passing through entire GPU devices in a similar fashion detailed in the Local SSD section, or by using GPU virtualization technologies like Intel GVT-g or NVIDIA GRID vGPU. OpenStack developers have been diligently integrating these technologies, I’d recommend operations folk understand how to deploy them once these features mature.

Object storage

In both public and private clouds, deploying multiple analytics clusters backed by object storage is becoming increasingly popular. In the private cloud, a number of things are important to prepare a Ceph object store for data intensive applications.

Bucket sharding

Bucket sharding was enabled by default with the advent of Red Hat Ceph Storage 3.0. This feature spreads a bucket’s indexes across multiple shards with corresponding RADOS objects. This is great for increasing the write throughput of a particular bucket, but comes at the expense of LIST operations. This is because of the way index entries are interleaved, and needing to gather entries before replying to a LIST request. Today, the S3A filesystem client performs many LIST requests, and as such it is advantageous to disable bucket sharding with rgw_override_bucket index_max_shards set to 1.

Bucket indexes on SSD

The Ceph object gateway uses distinct pools for objects and indexes, and as such those pools can be mapped to different device classes. Due to the S3A filesystem client’s heavy usage of LIST operations, it’s highly recommended that index pools be mapped to OSDs sitting on SSDs for fast access. In many cases, this can be achieved even on existing hardware by using the remaining space on devices that house OSD journals.

Erasure coding

Due to the immense storage requirements of data platforms, erasure coding for the data section of objects is a no brainer. Compared to 3x replication that’s common with HDFS, erasure coding reduced the required storage by 50%. When tens of petabytes are involved, that amounts to big savings! Most folks will probably end up using either 4+2 or 8+3 and spreading chunks across hosts using ruleset-failure-domain=host.

Anatomy of the S3A filesystem client

Amazon introduced their Simple Storage Service (S3) in March 2006, which proved to be a watershed moment that ushered in the era of cloud computing services. It wasn’t long before folks started trying to use Amazon S3 in conjunction with Apache Hadoop; in fact, the first attempt was the S3 block filesystem, which was completed before the end of the year. This early integration stored data in a way that facilitated fast renames and deletes, but came at the expense of not being able to access data that had been written to S3 directly. Accessing data written by other applications was highly desirable, and by 2008 the S3N, or S3 Native Filesystem, was merged into Apache Hadoop. The following year, Amazon introduced Elastic MapReduce, which included Amazon’s own close-sourced S3 filesystem client.

Netflix was an early adopter of Elastic MapReduce, which was used to analyze data to improve streaming quality. This workload was one of three detailed in Amazon’s press release announcing Netflix’s intent to migrate a variety of applications to Amazon. Fast forward to 2013 and “any data set worth retaining” was stored in S3. To the best of my knowledge, this was the first public reference of a multi-cluster data platform that used S3 for shared storage. With all the chips on the table, Netflix and other cloud heavyweights started seriously thinking about how to better leverage the S3 API and improve S3 client performance. This led to the development of the successor to S3N, named S3A. The development of the S3A filesystem client has manifested as a series of phases and is still seeing loads of active development from the likes of Netflix, Cloudera, Hortonworks, and a whole host of others. Each phase of development is tracked in the Hadoop JIRA:

Downstream Hadoop distributions have done a terrific job of ensuring that juicy features are expeditiously backported, so even if your vendors distribution is based on an older Hadoop version, it’s likely that they are ready to consume data stored in S3.

Working with Ceph

Back in the early days of Ceph, we quickly came to the realization that it would be useful to allow folks to use the S3 API to interact with their Ceph storage infrastructure. By doing so, we’d be able to leverage the might of the Amazon ecosystem, including all the SDKs and tools written to interact with Amazon S3. The Ceph object gateway was conceived for this application and is an essential ingredient in providing private infrastructure operators a means to extend cloud object storage modalities into their data center. The S3 API has an ever-expanding set of calls, and keeping up requires a lot of hard work. To keep us honest, we actively develop a functional test suite affectionately named s3-tests. The test suite is so useful that it’s even been adopted by other storage vendors to ensure their products can have the same high level of fidelity with the S3 API—How flattering!

So Ceph and S3 are like two high school buddies, and that’s great. But what’s required to ensure maximum fidelity with Amazon S3?


The Amazon S3 API provides a high-level container for objects which, in S3 parlance, is called a bucket. All objects are stored in exactly one bucket, and each bucket is mapped to a single Amazon S3 region. If you send a GET request for an object that lives in a bucket in another region, you’ll get a 302 redirect to the proper region. PUT requests sent to the wrong region will fail and be provided with the endpoint of the region the bucket calls home. If you have multiple Ceph clusters and want to emulate this behavior, you can do so by having each cluster be a distinct zone, in a distinct zonegroup, but with a common realm. For more details, refer to Ceph multi-site documentation.

If data engineers have a level of understanding about different endpoints available to them, then they can set fs.s3a.endpoint in a properties file such that it directs requests to the correct Ceph cluster or Amazon S3 region.

If you want to be helpful and know that an analytics cluster will only be talking to a single Ceph cluster, and not Amazon S3, then you might set the fs.s3a.endpoint in the core-site.xml. If you plan on having multiple Ceph clusters, or communicating with a Ceph cluster and the public cloud, then you have a few options. One is to set a default f3.s3a.endpoint in the analytics clusters’ core-site.xml to either an Amazon S3 regional endpoint or a Ceph endpoint. Application owners can still override this default with a properties file.

With  S3A from Apache Hadoop 2.8.1 users can define different S3A properties for different buckets. This is handy for applications that might operate on data sets in one or more Ceph cluster or Amazon S3 region.

Access control

Now that there is a level of understanding around endpoints, we can move on to authentication and access control. Amazon S3 has evolved over the years to provide more flexible control over who has access to what. Coarsely, the evolution can be broken down into multiple approaches.

  1. S3 ACLs
  2. Bucket policy with S3 users
  3. STS issued temporary credentials

The first approach is table stakes for any storage system that wants to allow applications to use the S3 API to interact with them. Each user has an access key and a secret key which they use to sign requests. Buckets always belong to a single user, and buckets can be specified as either public or private. Public buckets is the only means of sharing data between users, which is pretty inflexible. Ceph has supported basic S3 ACLs since the Ceph object gateway was introduced, and it’s not uncommon to see this be the only means of providing access control with other systems that tout an S3 compatible API. These access and secret keys can be mapped to the fs.s3a.access.key and fs.s3a.secret.key parameters in core-site.xml, in a properties file submitted with an application, or the most secure option: storing them in encrypted files using the Hadoop Credentials API.

Ceph doesn’t stop there. With Red Hat Ceph Storage 3.0, based on upstream Luminous, we added support for bucket policy, which allows cross user bucket sharing. This means one user’s private bucket can be shared with another user through bucket policy.

This is all good and well, but sometimes you want to manage access control for groups of users, instead of having to update a bunch of bucket policies whenever a user needs to be removed from a group. Amazon S3 doesn’t yet have its own notion of groups. Instead, Amazon has IAM, which is a means of enforcing role-based access control. IAM supports users, groups, and roles. This means you can create policies for a group, instead of having policy entries for each individual user. Unfortunately, IAM groups cannot be principals in bucket policies. IAM roles are similar to a user, but they do not have credentials. IAM roles delegate to some other authentication provider, and that authentication provider decides if the request is allowed to assume a particular role.

This is where Amazon STS comes in. STS issues temporary authentication tokens, which are mapped to IAM roles, and that mapping is attested by an external identity provider. As such, the external credentials provider can attest to whether a particular user can receive a token that allows that user to assume a IAM role. The S3A filesystem client supports the use of these tokens by changing the S3A Credentials Provider and setting the fs.s3a.session.token parameter in addition to fs.s3a.access.key and fs.s3a.secret.key. The upstream community is currently working on STS and IAM support in the Ceph object gateway, which will bring all this goodness to folks interacting with Ceph object stores. If this is important to your organization, we’re interested in getting feedback that helps us prioritize STS actions like AssumeRoleWithSAML and AssumeRoleWithWebIdentity.

Bucket prefix vs. path

The AWS Java SDK for S3 default is to use bucket prefix notation when sending requests and, by extension, so does the S3A filesystem client. If your Ceph object gateway endpoint is, then requests are sent to To ensure your Ceph storage infrastructure behaves the same way as Amazon S3, you’ll need to configure a few things on the infrastructure side. The first is including the rgw_dns_name parameter in the [rgw] or[global] block of your ceph.conf configuration file. The value of this parameter in this scenario would be Now, in order for the client to resolve the bucket subdomain, you’ll also need a wildcard DNS record in the form of * that resolves to your gateway virtual IP address. To support SSL with bucket prefix notation, you’ll need to use a certificate with a wildcard subject alternative name (SAN) wherever it is being used to terminate TLS.

If you’re not on the infrastructure side of things, and you want to consume Ceph object infrastructure where this machinery hasn’t been configured, then you can opt for path style access by setting to true. In this configuration, requests will be sent to instead of

There is a third trick, which might be helpful in unusual scenarios, and that’s to create a bucket with uppercase characters. Because DNS isn’t case sensitive, the SDK automatically switches over to path style access.


We talked a bit about SSL/TLS in the previous section, but only insofar as how to configure things on the infrastructure side. The S3A filesystem client will default to using SSL/TLS. Changing this behavior is done by switching the fs.s3a.connection.ssl.enabled parameter to false. This is useful in scenarios like testing, or when SSL isn’t yet configured for your Ceph storage infrastructure. In a production environment, the adage “dance like nobody’s watching, and encrypt like everyone is,” still applies. Check out this blog post for a thorough walk through on configuring SSL/TLS for you Ceph object storage system.

In addition to transport encryption, objects can be encrypted. Amazon S3 provides two categories of object encryption: (1) encryption at the client and (2) server-side encryption. The S3A filesystem client does not support using client-side encryption, but it does support all three varieties of server-side encryption. The three server-side encryption options are:

  • SSE-S3: Keys managed internally by Amazon S3
  • SSE-C: Keys managed by the client and passed to Amazon S3 to encrypt/decrypt requests
  • SSE-KMS: Keys managed through Amazon’s Key Management Service (KMS)

With the advent of  Red Hat Ceph Storage 3.0, we support the SSE-C flavor of server-side encryption. Configuring the S3A filesystem client to use it is a simple affair, involving only two parameters: (1) fs.s3a.server-side-encryption-algorithm should be set to SSE-C and (2) the value of fs.s3a.server-side-encryption.key should be set to your secret key. I suspect most folks will want to do this by way of a properties file.

Upstream Ceph also has support for SSE-KMS, which is intended to be integrated with OpenStack Barbican for key management. I imagine it’s only a matter of time before this is also supported in Red Hat Ceph Storage.

Fast upload

There are two main ways of performing writes with the S3A filesystem client. The default mode buffers uploads to disk before sending a request to Amazon S3; it’s important to make sure that the directory you are buffering to is large and fast. I prefer to specify a directory supported by a local SSD. Even with fast media, this can be expensive from an IO perspective, so an alternative is to use in memory buffers. The S3A filesystem client offers two in memory buffer options: (1) one using arrays and (2) one using byte buffers. You have to be careful when using memory buffers, because it’s easy to run out if you don’t properly size both your JVM and YARN containers. If the available memory permits you to use in memory buffers, they can be much faster than their disk-based brethren.

Giving it a try

There’s an easy way to learn how to use the S3A filesystem client with Ceph, and this section will walk you through setting up a development environment using Minishift. If you’ve never used Minishift before, it’s relatively painless to set up. The guides for installation on Mac OSX, Windows, and Linux can be found here.

Once you’ve installed Minishift on your local system, drop into a terminal and fire it up.

minishift start

Now that we have a Minishift environment running, we’re going to get Ceph Nano running so we can use it as a S3A endpoint. To get started, you’ll need to download a couple of YAML files: ceph-nano.yml and ceph-rgw-keys.yml. Once those files are downloaded to your working directory, we can use the oc command line utility to deploy Ceph Nano:

oc --as system:admin adm policy add-scc-to-user anyuid \
oc create -f ceph-rgw-keys.yml
oc create -f ceph-nano.yml
oc expose pod ceph-nano-0 --type=NodePort

You should have a Ceph Nano service running at this point and may be wondering how you’re going to interact with it. Jupyter Notebooks have become wildly popular in the analytics and data-science communities because of their ability to create a single artifact, the notebook, which contains both code and documentation. As if that wasn’t enough, you can interact with them from the comfort of your web browser. Here at Red Hat, we’ve been fostering an awesome community called They’re hard at work empowering intelligent application development on OpenShift. They’ve provided a base notebook application that I used as a starting point for playing with S3A from Jupyter, because its container image is neatly bundled with Spark and PySpark libraries. You can get Jupyter up and running in your Minishift environment with just a few commands:

oc new-app \
  -e RGW_API_ENDPOINT=$(minishift openshift service ceph-nano-0 --url) \

oc env --from=secret/ceph-rgw-keys dc/ceph-notebook
oc expose svc/ceph-notebook
oc status

The “oc status” command will provide a http://ceph-notebook-myproject.$(IP) URL. Load that into your browser and follow along with your browser!

HTTPS-ization of Ceph object storage public endpoint

Hypertext Transfer Protocol Secure (HTTPS) is the secure version of HTTP, uses encrypted communication between the user and the server. HTTPS avoids Man-in-the-Middle-Attack attacks by relying on Secure Socket Layer (SSL) and Transport Layer Security (TLS) protocols to establish an encrypted connection to shuttle data securely between a client and a server.

This blog post takes you step by step through the process of adding SSL/TLS security to Ceph object storage endpoints. Ceph is a scalable, open-source object storage solution that provides Amazon S3 and SWIFT compatible APIs you can use to build your own public or private cloud object storage solution. Learn more about Ceph at Red Hat Ceph Storage.

Setting up domain record sets

Let’s begin by setting up a domain name and configuring its record sets. If you already have a domain name you’d like to use for the Ceph endpoint, great. If not, you can purchase one from any domain registrar. In this example, we’ll use the domain name

First, we need to add record sets to the domain such that the domain name can resolve into the IP address of the Ceph RADOS Gateway or the load balancer (LB) that is fronting your Ceph RADOS Gateways. To do this, you need to log in to your domain registrar dashboard and create two record sets:

  1. A type recordset with Domain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host. (If you’re using IPv6, choose AAAA type.)
  2. CNAME type recordset Wildcard Subdomain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host

Note: The reason we chose the wildcard subdomain name is important. We want to resolve all subdomains (e.g., to the same IP address, because S3 treats subdomain prefixes as the bucket name.

Domain record set additions are highlighted in the following screenshot:

DNS changes usually take a few tens of minutes to propagate, once the changes are synced. We should be able to ping domain name as well as any subdomain from anywhere on the Internet, provided ICMP Ingress traffic is allowed on the host.

karasing-OSX:~$ ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=40 time=1095.231 ms
64 bytes from icmp_seq=1 ttl=40 time=1099.570 ms
64 bytes from icmp_seq=2 ttl=40 time=1199.266 ms
64 bytes from icmp_seq=2 ttl=40 time=1199.266 ms
--- ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1095.231/1131.356/1199.266/48.053 ms
karasing-OSX:~$ ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=40 time=1491.105 ms
64 bytes from icmp_seq=1 ttl=40 time=1262.021 ms
64 bytes from icmp_seq=2 ttl=40 time=1205.943 ms
64 bytes from icmp_seq=2 ttl=40 time=1205.943 ms
--- ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1205.943/1319.690/1491.105/123.352 ms

SSL certificate: Installation and setup

A SSL certificate is a set of encrypted files that binds an organization’s identity, domain name, IP address, and cryptographic keys. Once these SSL certificates are installed on the host server, a secure connection is allowed between the host server and the client machine. In a client’s Internet browser, a green padlock will appear next to the URL as a visual cue to users that traffic is protected.

Some organizations might desire a more advanced certificate that requires additional validation. These SSL certificates must be purchased from a trusted Certificate Authority (CA). For the sake of demonstration in this example, we’ll use Let’s Encrypt which is a certificate authority that provides free X.509 certificates for Transport Layer Security (TLS) encryption. [ credits: [wikipedia](’s_Encrypt) & thanks Let’s Encrypt for your free service]

Next, we’ll install epel-release and certbot, a CLI tool for requesting SSL certificates, from Let’s Encrypt CA.

yum install -y
yum install -y certbot

Request for SSL certificate using certbot CLI client

certbot certonly --manual -d -d * --agree-tos
manual-public-ip-logging-ok --preferred-challenges dns-01 --server

Note: The first -d option in certbot CLI represent base domain, the subsequent -d options represents sub-domains.

Important: Make sure you are using wildcard (*) for subdomain, because we are requesting a wildcard subdomain SSL certification from Let’s Encrypt. If a certificate is only issued for base domain,  it will not be compatible with subdomain prefix notation.

The following snippet shows the output of certbot CLI command. The DNS challenge method will generate two DNS TXT records, which must be added as a TXT record set for your domain (from the domain registrar dashboard):

[root@ceph-admin ~]# certbot certonly --manual -d -d * --agree-tos
--manual-public-ip-logging-ok --preferred-challenges dns-01 --server
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Starting new HTTPS connection (1):
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for
dns-01 challenge for

Please deploy a DNS TXT record under the name with the following value:

BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4   <== This Value

Before continuing, verify the record is deployed.

Press Enter to continue

Please deploy a DNS TXT record under the name with the following value:

O-_g-eeu4cSI0xXSdrw3OBrWVgzZXJC59Xjkhyk39MQ    <== This Value

Before continuing, verify the record is deployed.


Keep the certbot CLI command running, and open your domain registrar dashboard. Deploy a new DNS TXT record under the name, and enter both these values (marked with <== in the preceding output) in two separate lines inside double quotes. The following screenshot shows how:

Keep the certbot CLI command running and, once you’ve successfully added DNS TXT record in your domain record set, open another terminal. Now, we’ll verify if DNS TXT records are applied on your domain by running the command host -t txt You should be able to see the same DNS TXT as mentioned in certbot CLI. If you do not, wait for a few minutes for DNS synchronization to occur.

karasing-OSX:~$ host -t txt descriptive text
"BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4" descriptive text

Once you’ve verified DNS TXT records are applied to your domain, return to certbot CLI and press Enter to continue. You will be notified that the system is waiting for verification and cleaning up challenges. You should then see the following message:

Congratulations! Your certificate and chain have been saved at:
Your key file has been saved at:
Your cert will expire on 2018-10-07. To obtain a new or tweaked
version of this certificate in the future, simply run certbot
again. To non-interactively renew *all* of your certificates, run
"certbot renew"

If you like Certbot, please consider supporting our work by:

Donating to ISRG / Let's Encrypt:
Donating to EFF:          

[root@ceph-admin ~]#

At this point, SSL certificates have been issued to your domain/host, so now we must verify them.

[root@ceph-admin ~]# ls -l /etc/letsencrypt/live/
total 4
lrwxrwxrwx 1 root root  35 Jul 9 19:06 cert.pem -> ../../archive/
lrwxrwxrwx 1 root root  36 Jul 9 19:06 chain.pem -> ../../archive/
lrwxrwxrwx 1 root root  40 Jul 9 19:06 fullchain.pem -> ../../archive/
lrwxrwxrwx 1 root root  38 Jul 9 19:06 privkey.pem -> ../../archive/
-rw-r--r-- 1 root root 682 Jul  9 19:06 README
[root@ceph-admin ~]#

Installing LB for object storage service

Many like HAProxy, because it’s easy and does its job well. In this example, we will use HAProxy to perform SSL termination for our domain name (Ceph object storage endpoint).

Note: Starting with Red Hat Ceph Storage 2, Ceph RADOS Gateway natively supports TLS by relying on OpenSSL Library. You can get more information on native SSL/TLS configuration here. In this example, we specifically choose to terminate SSL at HAProxy level. This gives us an advantage like when we have multiple instances of Ceph RGW we do not need to get multiple domain names/SSL certificates for each of them. One domain name with SSL termination at LB does the job.

Next, we’ll install HAproxy on the same host whose public IP has been bound with your domain name, in this case,

yum install -y haproxy

Next, we’ll create a certs directory.

mkdir -p /etc/haproxy/certs

Combine certificate files fullchain.pem and privkey.pem into a single file for our domain next.

DOMAIN='' sudo -E bash -c 'cat /etc/letsencrypt/live/$DOMAIN/fullchain.pem
/etc/letsencrypt/live/$DOMAIN/privkey.pem > /etc/haproxy/certs/$DOMAIN.pem'

The next step is to change the permission of the certs directory.

chmod -R go-rwx /etc/haproxy/certs

Optionally, you can move the haproxy.cfg original file and create a new config file with the following configuration settings:

mv /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.orig
vim /etc/haproxy/haproxy.cfg
    log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/
    maxconn     4000
    user        haproxy
    group       haproxy
    tune.ssl.default-dh-param 2048     
    stats socket /var/lib/haproxy/stats
    mode                    http
    log                    global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except
    option                  redispatch
    option httpchk HEAD /
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000
frontend www-http
   reqidel                      ^X­Forwarded­For:.*
   reqadd X-Forwarded-Proto:\ http
   default_backend www-backend
   option  forwardfor
frontend www-https
   bind ssl crt /etc/haproxy/certs/
   reqadd X-Forwarded-Proto:\ https
   acl letsencrypt-acl path_beg /.well-known/acme-challenge/
   use_backend letsencrypt-backend if letsencrypt-acl
   default_backend www-backend
backend www-backend
   redirect scheme https if !{ ssl_fc }
   server ceph-admin check inter 2000 rise 2 fall 5
backend letsencrypt-backend
   server letsencrypt

As you may have noted, we’re using a couple of non-default parameters in the haproxy config file, such as:

  • tune.ssl.default-dh-param is required to provide OpenSSL the necessary parameters for the SSL/TLS handshake.
  • frontend www-http binds haproxy to port 80 of the local machine and redirects traffic to the default backend www-backend. If a user uses HTTP protocol in the request, it should redirect to HTTPS.
  • frontend www-https binds haproxy to port 443 of the local machine. It also redirects traffic to the default backend www-backend and uses the SSL certificate path to encrypt/terminate the traffic.
  • frontend www-https also uses letsencrypt-backend if you want to auto-renew the SSL certificate from Let’s Encrypt CA.
  • backend www-backend simply redirects all the SSL terminated traffic to Ceph RGW node For HA and performance, you must have multiple Ceph RGW instances whose IPs should be added in the same backend section so that HAProxy can load balance among Ceph RGW instances.

Finally, we restart HAproxy and verify its listening on ports 80 and 443:

systemctl start haproxy
systemctl status haproxy ; netstat -plunt | grep -i haproxy

Configuring Ceph RGW

Up to this point, we’ve configured the domain name, set required record sets, generated SSL certificate, configured HAProxy to encrypt/terminate SSL, and redirected traffic to the Ceph RGW instance. Now we need to configure Ceph RGW to listen on port 8081 as configured in HAproxy. To do so, in the Ceph RGW node edit /etc/ceph/ceph.conf and update the client.rgw section as shown in the following:

host = ceph-admin
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-admin/keyring
log file = /var/log/ceph/ceph-rgw-ceph-admin.log
rgw frontends = civetweb port= num_threads=512
rgw resolve cname = true
rgw dns name =

Note: rgw resolve cname = true forces rgw to use the DNS CNAME record of the request hostname field (if the hostname is not equal to rgw dns name).

Note: rgw dns name = is the DNS name of the served domain.

Now, we’ll restart the Ceph RGW instance and verify its listening on port 8081.

systemctl restart ceph-radosgw@rgw.ceph-admin.service
netstat -plunt | grep -i rados

Accessing Ceph object storage secure endpoint

To test the HTTPS-enabled Ceph object storage URL, execute the following curl command or type in any web browser:


It should yield output like the following:

[student@ceph-admin ~]$ curl
[student@ceph-admin ~]$
[student@ceph-admin ~]$

Let’s try accessing Ceph object storage using S3cmd:

yum install -y s3cmd

Configure S3cmd CLI by providing config options like access/secret keys, Ceph S3 secure endpoints in host/host-bucket parameters.

s3cmd --access_key=S3user1 --secret_key=S3user1key
--host-bucket="%(bucket)" --dump-config > /home/student/.s3cfg

Note: By default, s3cmd uses HTTPS connection, so there is no need to explicitly specify that.

Next, we’ll interact with Ceph object storage using s3cmd ls, s3cmd mb commands:

[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
[student@ceph-admin ~]$
[student@ceph-admin ~]$ s3cmd mb s3://secure_bucket
Bucket 's3://secure_bucket/' created
[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
2018-07-09 19:55  s3://secure_bucket
[student@ceph-admin ~]$

Congratulations! You’ve successfully secured your Ceph object storage endpoint using you domain name of your choice and SSL certificates.


As you can see, acquiring and setting up SSL certificates involves some careful configuration and depends on your chosen CA (how Easy & Fast to acquire SSL certificate). With initiatives like “HTTPS Everywhere,” it’s no longer just web sites hosting deliverable content that must have SSL; API and service endpoints should also offer encrypted transport.

Note: HTTPS is designed to prevent eavesdropping and Man-in-the-Middle-Attacks. Always practice defense in depth. Multiple layers of security are needed to more fully secure your web site/service endpoints.

Why are customers choosing Red Hat’s Container-Native Storage in the public cloud with OpenShift?

By Sayandeb Saha, Director, Product Management, Storage Business Unit

In our last blog post in this series, we talked about how the Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen increased customer adoption in on-premise environments by offering a peaceful coexistence approach with classic storage arrays that are not deeply integrated with OpenShift. In this post, we’ll explore why many customers are deploying our CNS offering in the three big public clouds—AWS, Microsoft Azure, and the Google Cloud Platform—on top of native public cloud offerings from the public clouds—despite good integration of Kubernetes with native storage offerings in the cloud. Let’s examine some of these problems and constraints in a bit more detail and describe how CNS addresses them.

Slow attach/detachpoor availability

The first issue stems from the fact that the native block storage offerings (EBS in AWS, Data Disk in Azure, Persistent Disk in Google Cloud) in the public cloud were designed and engineered to support virtual machine (VM) workloads. In such workloads, attaching and consequently detaching a block device to a machine image/instance is an infrequent occurrence at best, as these workloads are less dynamic compared to Platform-as-a-Service (PaaS) and DevOps workloads, which frequently run on OpenShift powering dynamic build and deploy CI/CD pipelines and other similar workloads and workflows.

Some of our customers found that attach and detach times for these block devices, when directly accessed from OpenShift workloads using the native kubernetes storage provisioners, are unacceptable because they led to poor startup times for pods (slow attach) and limited or no high availability on a failover, which usually triggers a sequence that includes a detach operation, an attach operation, and a subsequent mount operation.

Each of these operations usually triggers a variety of API calls specific for the public cloud provider. Any or all of these intermediate steps can fail, causing users to lose access to storage persistent volumes (PVs) for their compute pods for an extended period. Overlaying Red Hat’s CNS offering as a storage management fabric to aggregate, pool, and serve out PVs expediently without worrying about the status of individual cloud native block storage (a.k.a EBS or Azure Data Disk) can provide major relief, because it effectively isolates the lifecycle of cloud-native block storage devices from that of the application pods allocating and deallocating PVs dynamically as application teams work on OpenShift. This isolation effectively addresses this issue.

Block device limits per compute instance

The second issue some of our customers run into is the fact that there is a limit to the number of block devices that one can attach to the machine images or instances in various public cloud environments.

OpenShift supports a maximum of 250 containers per host. The maximum number of block devices that are supported to be attached to machine instances per account is far fewer (for example, max 40 EBS devices per EC2 instance). Even though it is unusual to have a 1:1 mapping between containers and storage devices, this low maximum can lead to a lot of unintended behavior, notwithstanding the fact that it leads to a higher total cost of ownership (need more hosts than necessary).

For example, in a failover scenario during the detach, attach, and mount sequence, the API call to attach might fail, because there are already a maximum number of devices attached to the EC2 instance where this attempt is being made, which can cause a glitch/outage. Overlaying Red Hat’s CNS offering as a storage management fabric on cloud-based block devices mitigates the impact of hitting the maximum number of devices that can be attached to a machine image or instance, because storage is served out from a pool that is unencumbered by individual max device per instance/host limit. Storage can continue to be served out until the entire pool is exhausted which, at that time, can be expanded by adding new hosts and devices.

Cross-AZ storage availability

The third issue arises from the fact that cloud block storage devices are usually accessible within a specific Availability Zone (AZ) in AWS or Availability Sets in Azure. AZs are like failure domains in public clouds.

Most customers who deploy OpenShift in the public cloud do so to span more than one AZ for high availability. This is done so that when one AZ dies or goes offline, the OpenShift cluster remains operational. Using block devices constrained to an AZ for providing storage services to OpenShift workloads can defeat the purpose, because then containers must be scheduled within hosts that belong to the same AZ, and customers can not leverage the full power of Kubernetes orchestration. This configuration could also lead to an outage when an AZ goes offline.

Our customers use CNS to mitigate this problem so that even when there is an AZ failure, a three-way replicated cross-AZ storage service (CNS) is available for containerized applications to avoid downtimes. This also enables Kubernetes to schedule pods across AZs (instead of within an AZ), thereby preserving the spirit of the original fault-tolerant OpenShift deployment architecture that spans multiple AZs.

Cost-effective storage consolidation

Storage provided by CNS is efficiently allocated and offers performance with the first gigabyte provisioned, thereby enabling storage consolidation. For example, consider six MySQL database instances, each in need of 25 GiB of storage capacity and up to 1500 IOPS at peak load. With EBS in AWS, one would create six EBS volumes, each with at least 500 GiB capacity out of the gp2 (General Purpose SSD) EBS tier, in order to get 1500 IOPS. The level of performance is tied to provisioned capacity with EBS.

With CNS, one can achieve the same level using only 3 EBS volumes at 500 GiB capacity from the gp2 tier and run these with GlusterFS. One would create six 25 GiB volumes and provide storage to many databases with high IOPS performance, provided they don’t peak all at the same time. Doing that, one would halve EBS cost and still have capacity to spare for other services. Read IOPS performance is likely even higher, because in CNS with three-way replication as data is read from distributed across 3×1500 IOPS gp2 EBS volumes.

Check us out for more

As you can see, there’s a good case to be made for using CNS in various public clouds for a multitude of technical reasons our customers care about, besides the fact that Red Hat CNS provides a consistent storage consumption and management experience across hybrid and multi clouds (see the following figure).


Red Hat CNS runs anywhere and everywhere Red Hat OpenShift Container Platform runs.

In addition to the application portability that OpenShift already provides across hybrid and multi clouds, we’re working on multi cloud replication features that would enable CNS to effectively become the data fabric that enables data portability—another good reason to select and stay with CNS. Stay tuned for more information on that!

For hands-on experience now combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.

What about locality?

This is the first post of a multi-part series of technical blog posts on Spark on Ceph:

  1. What about locality?
  2. Anatomy of the S3A filesystem client
  3. To the cloud!
  4. Storing tables in Ceph object storage
  5. Comparing with HDFS—TestDFSIO
  6. Comparing with remote HDFS—Hive Testbench (SparkSQL)
  7. Comparing with local HDFS—Hive Testbench (SparkSQL)
  8. Comparing with remote HDFS—Hive Testbench (Impala)
  9. Interactive speedup
  10. AI and machine learning workloads
  11. The write firehose

Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:

“Will performance suck because the benefits of locality are lost?”

It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.

Historical context

Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.

Cluster of 1800 machines, [each with]:

  • 4GB of memory
  • Dual-processor 2 GHz Xeons with hyperthreading
  • Dual 160GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth of 100 Gb/s

If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.

The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.

Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.

While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.


When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.


Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.

One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.

Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.

In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.

In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.


Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.

Why Spark on Ceph? (Part 3 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we in the Red Hat Storage solutions architecture team wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (this blog post)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series in mid-July, providing detailed descriptions, test data, and configuration scenarios.

Findings summary

We did Ceph vs. HDFS testing with a variety of workloads (see blog Part 2 of 3 for general workload descriptions). As expected, the price/performance comparison varied based on a number of factors, summarized below.

Clearly, many factors contribute to overall solution price. As storage capacity is frequently a major component of big data solution price, we chose it as a simple proxy for price in our price/performance comparison.

The primary factor affecting storage capacity price in our comparison was the data durability scheme used. With 3x replication data durability, a customer needs to buy 3PB of raw storage capacity to get 1PB of usable capacity. With erasure coding 4:2 data durability, a customer only needs to buy 1.5PB of raw storage capacity to get 1PB of usable capacity. The primary data durability scheme used by HDFS is 3x replication (support for HDFS erasure coding is emerging, but is still experimental in several distributions).  Ceph has supported either erasure coding or 3x replication data durability schemes for years. All Spark-on-Ceph early adopters we worked with are using erasure coding for cost efficiency reasons. As such, most of our tests were run with Ceph erasure coded clusters (we chose EC 4:2). We also ran some tests with Ceph 3x replicated clusters to provide apples-to-apples comparison for those tests.

Using the proxy for relative price noted above, Figure 1 provides an HDFS v. Ceph price/performance summary for the workloads indicated:

Figure 1: Relative price/performance comparison, based on results from eight different workloads

Figure 1 depicts price/performance comparisons based on eight different workloads. Each of the eight individual workloads was run with both HDFS and Ceph storage back-ends. The storage capacity price of the Ceph solution relative to the HDFS solution is either the same or 50% less. When the workload was run with Ceph 3x replicated clusters, the storage capacity price is shown as the same as HDFS. When the workload was run with Ceph erasure coded 4:2 clusters, the Ceph storage capacity price is shown as 50% less than HDFS. (See the previous discussion on how data durability schemes affect solution price.)

For example, workload 8 had similar performance with either Ceph or HDFS storage, but the Ceph storage capacity price was 50% of the HDFS storage capacity price, as Ceph was running an erasure coded 4:2 cluster. In other examples, workloads 1 and 2 had similar performance with either Ceph or HDFS storage and also had the same storage capacity price (workloads 1 and 2 were run with a Ceph 3x replicated cluster).

Findings details

A few details are provided here for the workloads tested with both Ceph and HDFS storage, as depicted in Figure 1.

  1. This workload was a simple test to compare aggregate read throughput via TestDFSIO. As shown in Figure 2, this workload performed comparably between HDFS and Ceph, when Ceph also used 3x replication. When Ceph used erasure coding 4:2, the workload performed better than either HDFS or Ceph 3x for lower numbers of concurrent clients (<300). With more client concurrency, however, the workload performance on Ceph 4:2 dropped due to spindle contention (a single read with erasure coded 4:2 storage requires 4 disk accesses, vs. a single disk access with 3x replicated storage.)

    Figure 2: TestDFSIO read results
  2. This workload compared the SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2.

    Figure 3: Single-user Spark query set results
  3. This workload compared Impala query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was 57% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  4. This mixed workload featured concurrent execution of a single-user running SparkSQL queries (54), 10-users each running Impala queries (54 each), and a data set merge/join job enriching TPC-DS web sales data with synthetic clickstream logs. As illustrated in Figure 1, the aggregate execution time of this mixed workload on Ceph EC4:2 was 48% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  5. This workload was a simple test to compare aggregate write throughput via TestDFSIO. As depicted in Figure 1, this workload performed, on average, 50% slower on Ceph EC4:2 compared to HDFS, across a range of concurrent client/writers. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  6. This workload compared SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2. However, price/performance was nearly comparable when running against Ceph EC4:2, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  7. This workload featured enrichment (merge/join) of TPC-DS web sales data with synthetic clickstream logs, and then writing the updated web sales data. As depicted in Figure 4, this workload was 37% slower on Ceph EC4:2 compared to HDFS. However, price/performance was favorable for Ceph, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.

    Figure 4: Data set enrichment (merge/join/update) job results
  8. This workload compared the SparkSQL query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was roughly comparable to that of HDFS, despite requiring only 50% the storage capacity costs. Price/performance for this workload thus favors Ceph by 2x. For more insight into this workload performance, see Figure 5. In this box-and-whisker plot, each dot reflects a single SparkSQL query execution time. As each of the 10-users concurrently executes 54 queries, there are 540 dots per series. The three series shown are Ceph EC4:2 (green), Ceph 3x (red), and HDFS 3x (blue). The Ceph EC4:2 box shows comparable median execution times to HDFS 3x, and shows more consistent query times in the middle 2 quartiles.
Figure 5: Multi-user Spark query set results

Bonus results section: 24-hour ingest

One of our prospective Spark-on-Ceph customers recently asked us to illustrate Ceph cluster sustained ingest rate over a 24-hour time period. For these tests, we used variations of the lab as described in blog 2 of 3. As noted in Figure 6, we measured a raw ingest rate of approximately 1.3 PiB per day into a Ceph EC4:2 cluster configured with 700 HDD data drives (Ceph OSDs).

Figure 6: Daily data ingest rate into Ceph clusters of various sizes

Concluding observations

In conclusion, below is our formative cost/benefit analysis of the above results summarizing this blog series.

  • Benefits, Spark-on-Ceph vs. Spark on traditional HDFS:
    1. Reduce CapEx by reducing duplication: Reduce PBs of redundant storage capacity purchased to store duplicate data sets in HDFS silos, when multiple analytics clusters need access to the same data sets.
    2. Reduce OpEx/risk: Eliminate costs of scripting/scheduling data set copies between HDFS silos, and reduce risk-of-human-error when attempting to maintain consistency between these duplicate data sets on HDFS silos, when multiple analytics clusters need access to the same data sets.
    3. Accelerate insight from new data science clusters: Reduce time-to-insight when spinning-up new data science clusters by analyzing data in-situ within a shared data repository, as opposed to hydrating (copying data into) a new cluster before beginning analysis.
    4. Satisfy different tool/version needs of different data teams: While sharing data sets between teams, enable users within each cluster to choose the Spark/Hadoop tool sets and versions appropriate to their jobs, without disrupting users from other teams requiring different tools/versions.
    5. Right-size CapEx infrastructure costs: Reduce over-provisioning of either compute or storage common with provisioning traditional HDFS clusters, which grow by adding generic nodes (regardless if only more CPU cores or storage capacity is needed), by right-sizing compute needs (vCPU/RAM) independently from storage capacity needs (throughput/TB).
    6. Reduce CapEx by improving data durability efficiency: Reduce CapEx of storage capacity purchased by up to 50% due to Ceph erasure coding efficiency vs. HDFS default 3x replication.
  • Costs, Spark-on-Ceph vs. Spark on traditional HDFS:

    1. Query performance: Performance of Spark/Impala query jobs ranged from 0%-131% longer execution times (single-user and multi-user concurrency tests).
    2. Write-job performance: Performance of write-oriented jobs (loading, transformation, enrichment) ranged from 37%-200%+ longer execution times. [Note: Significant improvements in write-job performance are expected when downstream distributions adopt the following upstream enhancements to the Hadoop S3A client HADOOP-13600, HADOOP-13786, HADOOP-12891].
    3. Mixed-workload Performance: Performance of multiple query and enrichment jobs concurrently executed resulted in 90% longer execution times.

For more details (and a hands-on chance to kick the tires of this solution yourself), stay tuned for the architect-level blog series in this same Red Hat Storage blog location. Thanks for reading.

Why Spark on Ceph? (Part 2 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (this blog post)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph (Part 3 of 3)“)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series providing detailed descriptions, test data, and configuration scenarios.

Basic analytics pipeline using a Ceph object store

Our early adopter customers are ingesting, querying, and transforming data directly to and from a shared Ceph object store.  In other words, target data locations for their analytics jobs are something like “s3://bucket-name/path-to-file-in-bucket” within Ceph, instead of something like “hdfs:///path-to-file”.  Direct access to S3-compatible object stores via analytics tools like Spark, Hive, and Impala is made possible via the Hadoop S3A client.

Jointly with several customers, we successfully ran 1000s of analytics jobs directly against a Ceph object store using the following analytics tools:

Figure 1: Analytics tools tested with shared Ceph object store

In addition to running simplistic tests like TestDFSIO, we wanted to run analytics jobs which were representative of real-world workloads. To do that, we based our tests on the TPC-DS benchmark for ingest, transformation, and query jobs. TPC-DS generates synthetic data sets and provides a set of sample queries intended to model the analytics environment of a large retail company with sales operations from stores, catalogs, and the web.  Its schema has 10s of tables, with billions of records in some tables. It defines 99 pre-configured queries, from which we selected the 54 most IO-intensive for out tests. With partners in industry, we also supplemented the TPC-DS data set with simulated click-stream logs, 10x larger than the TPC-DS data set size, and added SparkSQL jobs to join these logs with TPC-DS web sales data.

In summary, we ran the following directly against a Ceph object store:

  • Bulk Ingest (bulk load jobs – simulating high volume streaming ingest at 1PB+/day)
  • Ingest (MapReduce jobs)
  • Transformation (Hive or SparkSQL jobs which convert plain text data into Parquet or ORC columnar, compressed formats)
  • Query (Hive or SparkSQL jobs – frequently run in batch/non-interactive mode, as these tools automatically restart failed jobs)
  • Interactive Query (Impala or Presto jobs)
  • Merge/join (Hive or SparkSQL jobs joining semi-structured click-stream data with structured web sales data)

Architecture overview

We ran variations of the tests outlined above with 4 large customers over the past year. Generally speaking, our architecture looked something like this:

Figure 2: High-level lab architecture

Did it work?

Yes.  1000s of analytics jobs described above completed successfully.  SparkSQL, Hive, MapReduce, and Impala jobs all using the S3A client to read and write data directly to a shared Ceph object store.  The related architect-level blog series will document detailed lessons learned and configuration techniques.

In the final episode of this blog series, we’ll get to the punch line – what was the performance compared to traditional HDFS?  Stay tuned for part 3 of this series….

Why Spark on Ceph? (Part 1 of 3)

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (this blog post)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph? (Part 3 of 3)”)

We’ll provide summary-level answers to these questions in a 3-part blog series.  In addition, for those wanting more depth, we’ll cross-link to a separate reference architecture blog series providing detailed descriptions, test data, and configuration scenarios.

Part 1: Why would companies do this?

Agility of many, the power of one.
The agility of many analytics clusters, with the power of one shared data store.
(Ok … enough with the simplistic couplets.)

Here are a few common problems that emerged from speaking with 30+ companies:

  • Teams that share the same analytics cluster are frequently frustrated because someone else’s job often prevents their job from finishing on-time.
  • In addition, some teams want the stability of older analytic tool versions on their clusters, while their peer teams need to load the latest-and-greatest tool releases.
  • As a result, many teams demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own needs.
  • However, each separate analytics cluster typically has its own, non-shared HDFS data store – creating data silos.
  • And to provide access to the same data sets across the silos, the data platform team frequently copies datasets between the HDFS silos, trying to keep them consistent and up-to-date.
  • As a result, companies end up maintaining many separate, fixed analytics clusters (50+ in one case), each with their own HDFS data silo containing redundant copies of PBs of data, while maintaining an error-prone maze of scripts to keep data sets updated across silos.
  • But, the resulting cost of maintaining 5, 10, or 20 copies of multi-PB datasets on the various HDFS silos is cost prohibitive to many companies (both CapEx and OpEx).

In pictures, their core problems and resulting options look something like this:

Figure 1. Core problems


Figure 2. Resulting Options

Turns out that the AWS ecosystem built a solution for choice #3 (see Figure 2 above) years ago through the Hadoop S3A filesystem client.  In AWS, you can spin-up many analytics clusters on EC2 instances, and share data sets between them on Amazon S3 (e.g. see Cloudera CDH support for Amazon S3).  No more lengthy delays hydrating HDFS storage after spinning-up new clusters, or de-staging HDFS data upon cluster termination.  With the Hadoop S3A filesystem client, Spark/Hadoop jobs and queries can run directly against data held within a shared S3 data store.  

Bottom-line … more-and-more data scientists and analysts are accustomed to spinning-up analytic clusters quickly on AWS with access to shared data sets, without time-consuming HDFS data-hydration and de-stage cyles, and expect the same capability on-premises.

Ceph is the #1 open-source, private-cloud object storage platform, providing S3-compatible object storage.  It was (and is) the natural choice for these companies looking to provide an S3-compatible shared data lake experience to their analysts on-premises.

Stay tuned for the next blog in the series, ‘Why Spark on Ceph? (Part 2 of 3)Will mainstream analytics jobs run directly against a Ceph object store?)’

Leverage your existing storage investments with container-native storage

By Sayandeb Saha, Director, Product Management

The Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen wide customer adoption in the past year or so. Customers are deploying it in a wide variety of environments that include bare metal, virtualized, and private and public clouds. It mimics the diverse spread of environments in which OpenShift itself gets deployed—which is also CNS’s key strength (i.e., being able to back OpenShift wherever it runs—see the following graphic).

During the past of year of customer adoption of CNS, we’ve observed some key trends that are unique for OpenShift/Kubernetes storage and that we’ll highlight in a series of blogs. This blog series will also include business and technical solutions that have worked for our customers.

In this blog post, we examine a trend where customers have adopted CNS as a storage management fabric that sits in between the OpenShift Container Platform and their classic storage gear. This particular adoption pattern continues to have a really high uptake, and there are sound business and technical reasons for doing this, which we’ll explore here.

First the Solution (The What): We’ve seen a lot of customers deploying CNS to serve out storage from their existing storage arrays/SANs and other traditional storage, as illustrated in the following graphic. In this scenario, block devices from existing storage arrays are served out with our CNS software running in VMs or containers/pods to OpenShift. In this case, the storage for the VMs that runs OpenShift is still served by the arrays.

Now the Why: Initially, it seemed backward as to why customers would be doing this; after all, software-defined storage solutions like CNS are meant to run on x86 bare metal (on premise) or in the public cloud, but further investigation revealed some interesting discoveries.

While OpenShift users and ops teams consume infrastructure, they typically do not manage infrastructure. In on-premise environments, OpenShift ops teams are highly dependent on other infrastructure teams for virtualization, storage, and operating systems for the infrastructure on which they run OpenShift. Similarly, in public clouds they consume the native compute and storage infrastructure available in these clouds.

As a consequence, they are highly dependent on storage infrastructure that is already in place. Typically, it’s very difficult to justify a storage server purchase when storage has been already procured a year or more ago from a traditional storage vendor for a new use case (OpenShift storage in this case). The issue is that this traditional storage was not designed for nor intended to be used with containers and the budget for storage has mostly been spent. This has driven the OpenShift operations teams to adopt CNS effectively as a storage management fabric that sits between their OpenShift Container Platform deployment and their existing storage array. The inherent flexibility of Red Hat Gluster Storage in this case is the form of CNS being leveraged, which enables it to aggregate and pool block devices that are attached to a VM and serve that out to OpenShift workloads. OpenShift operations teams can now have the best of both worlds. They can repurpose their existing storage array that is already in place/on premise but actually consume CNS which operates as a management fabric offering the latest and greatest in terms of feature, functionality, and manageability with a deep integration with the OpenShift platform.

In addition to business reasons, there are also various technical reasons that these OpenShift operations teams are adopting CNS. These include, but are not limited to:

  • Lack of deep integration of their existing storage arrays with OpenShift Container Platform
  • Even if their traditional storage array has rudimentary integration with OpenShift, very likely it has limited feature support, which renders it unusable with many OpenShift workloads (like lack of dynamic provisioning)
  • The roadmap of their storage arrays vendor may not match their current (or future) OpenShift/Kubernetes storage feature support needs, like lack of availability of a Persistent Volume (PV) resize feature
  • Needing a fully featured OpenShift Storage solution for OpenShift workloads as well as the OpenShift infrastructure itself. Many existing storage platforms can support one or the other, but not both. For instance, a storage array serving out Fiber Channels LUNs (plain block storage) can’t back an OpenShift registry as one needs shared storage access for it usually provided by a file or object storage back end.
  • They seek a consistent storage consumption and management experience across hybrid and multiple clouds. Once they learn to implement and manage CNS from Red Hat in one environment, it’s repeatable in all other environments. They can’t use their storage array in the public cloud.

Using CNS from Red Hat is a win for OpenShift ops teams. They can get started with a state-of-the-art storage back end for OpenShift apps and infrastructure without needing to acquire new infrastructure for OpenShift Storage right away. They have the option to move to x86-based storage servers during the following budget cycle as they grow their OpenShift footprint and onboard more apps and customers to it. The experience with CNS serves them well if they choose to implement OpenShift and CNS in other environments like AWS, Azure, and Google Cloud.

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.