Anatomy of the S3A filesystem client

Amazon introduced their Simple Storage Service (S3) in March 2006, which proved to be a watershed moment that ushered in the era of cloud computing services. It wasn’t long before folks started trying to use Amazon S3 in conjunction with Apache Hadoop; in fact, the first attempt was the S3 block filesystem, which was completed before the end of the year. This early integration stored data in a way that facilitated fast renames and deletes, but came at the expense of not being able to access data that had been written to S3 directly. Accessing data written by other applications was highly desirable, and by 2008 the S3N, or S3 Native Filesystem, was merged into Apache Hadoop. The following year, Amazon introduced Elastic MapReduce, which included Amazon’s own close-sourced S3 filesystem client.

Netflix was an early adopter of Elastic MapReduce, which was used to analyze data to improve streaming quality. This workload was one of three detailed in Amazon’s press release announcing Netflix’s intent to migrate a variety of applications to Amazon. Fast forward to 2013 and “any data set worth retaining” was stored in S3. To the best of my knowledge, this was the first public reference of a multi-cluster data platform that used S3 for shared storage. With all the chips on the table, Netflix and other cloud heavyweights started seriously thinking about how to better leverage the S3 API and improve S3 client performance. This led to the development of the successor to S3N, named S3A. The development of the S3A filesystem client has manifested as a series of phases and is still seeing loads of active development from the likes of Netflix, Cloudera, Hortonworks, and a whole host of others. Each phase of development is tracked in the Hadoop JIRA:

Downstream Hadoop distributions have done a terrific job of ensuring that juicy features are expeditiously backported, so even if your vendors distribution is based on an older Hadoop version, it’s likely that they are ready to consume data stored in S3.

Working with Ceph

Back in the early days of Ceph, we quickly came to the realization that it would be useful to allow folks to use the S3 API to interact with their Ceph storage infrastructure. By doing so, we’d be able to leverage the might of the Amazon ecosystem, including all the SDKs and tools written to interact with Amazon S3. The Ceph object gateway was conceived for this application and is an essential ingredient in providing private infrastructure operators a means to extend cloud object storage modalities into their data center. The S3 API has an ever-expanding set of calls, and keeping up requires a lot of hard work. To keep us honest, we actively develop a functional test suite affectionately named s3-tests. The test suite is so useful that it’s even been adopted by other storage vendors to ensure their products can have the same high level of fidelity with the S3 API—How flattering!

So Ceph and S3 are like two high school buddies, and that’s great. But what’s required to ensure maximum fidelity with Amazon S3?


The Amazon S3 API provides a high-level container for objects which, in S3 parlance, is called a bucket. All objects are stored in exactly one bucket, and each bucket is mapped to a single Amazon S3 region. If you send a GET request for an object that lives in a bucket in another region, you’ll get a 302 redirect to the proper region. PUT requests sent to the wrong region will fail and be provided with the endpoint of the region the bucket calls home. If you have multiple Ceph clusters and want to emulate this behavior, you can do so by having each cluster be a distinct zone, in a distinct zonegroup, but with a common realm. For more details, refer to Ceph multi-site documentation.

If data engineers have a level of understanding about different endpoints available to them, then they can set fs.s3a.endpoint in a properties file such that it directs requests to the correct Ceph cluster or Amazon S3 region.

If you want to be helpful and know that an analytics cluster will only be talking to a single Ceph cluster, and not Amazon S3, then you might set the fs.s3a.endpoint in the core-site.xml. If you plan on having multiple Ceph clusters, or communicating with a Ceph cluster and the public cloud, then you have a few options. One is to set a default f3.s3a.endpoint in the analytics clusters’ core-site.xml to either an Amazon S3 regional endpoint or a Ceph endpoint. Application owners can still override this default with a properties file.

With  S3A from Apache Hadoop 2.8.1 users can define different S3A properties for different buckets. This is handy for applications that might operate on data sets in one or more Ceph cluster or Amazon S3 region.

Access control

Now that there is a level of understanding around endpoints, we can move on to authentication and access control. Amazon S3 has evolved over the years to provide more flexible control over who has access to what. Coarsely, the evolution can be broken down into multiple approaches.

  1. S3 ACLs
  2. Bucket policy with S3 users
  3. STS issued temporary credentials

The first approach is table stakes for any storage system that wants to allow applications to use the S3 API to interact with them. Each user has an access key and a secret key which they use to sign requests. Buckets always belong to a single user, and buckets can be specified as either public or private. Public buckets is the only means of sharing data between users, which is pretty inflexible. Ceph has supported basic S3 ACLs since the Ceph object gateway was introduced, and it’s not uncommon to see this be the only means of providing access control with other systems that tout an S3 compatible API. These access and secret keys can be mapped to the fs.s3a.access.key and fs.s3a.secret.key parameters in core-site.xml, in a properties file submitted with an application, or the most secure option: storing them in encrypted files using the Hadoop Credentials API.

Ceph doesn’t stop there. With Red Hat Ceph Storage 3.0, based on upstream Luminous, we added support for bucket policy, which allows cross user bucket sharing. This means one user’s private bucket can be shared with another user through bucket policy.

This is all good and well, but sometimes you want to manage access control for groups of users, instead of having to update a bunch of bucket policies whenever a user needs to be removed from a group. Amazon S3 doesn’t yet have its own notion of groups. Instead, Amazon has IAM, which is a means of enforcing role-based access control. IAM supports users, groups, and roles. This means you can create policies for a group, instead of having policy entries for each individual user. Unfortunately, IAM groups cannot be principals in bucket policies. IAM roles are similar to a user, but they do not have credentials. IAM roles delegate to some other authentication provider, and that authentication provider decides if the request is allowed to assume a particular role.

This is where Amazon STS comes in. STS issues temporary authentication tokens, which are mapped to IAM roles, and that mapping is attested by an external identity provider. As such, the external credentials provider can attest to whether a particular user can receive a token that allows that user to assume a IAM role. The S3A filesystem client supports the use of these tokens by changing the S3A Credentials Provider and setting the fs.s3a.session.token parameter in addition to fs.s3a.access.key and fs.s3a.secret.key. The upstream community is currently working on STS and IAM support in the Ceph object gateway, which will bring all this goodness to folks interacting with Ceph object stores. If this is important to your organization, we’re interested in getting feedback that helps us prioritize STS actions like AssumeRoleWithSAML and AssumeRoleWithWebIdentity.

Bucket prefix vs. path

The AWS Java SDK for S3 default is to use bucket prefix notation when sending requests and, by extension, so does the S3A filesystem client. If your Ceph object gateway endpoint is, then requests are sent to To ensure your Ceph storage infrastructure behaves the same way as Amazon S3, you’ll need to configure a few things on the infrastructure side. The first is including the rgw_dns_name parameter in the [rgw] or[global] block of your ceph.conf configuration file. The value of this parameter in this scenario would be Now, in order for the client to resolve the bucket subdomain, you’ll also need a wildcard DNS record in the form of * that resolves to your gateway virtual IP address. To support SSL with bucket prefix notation, you’ll need to use a certificate with a wildcard subject alternative name (SAN) wherever it is being used to terminate TLS.

If you’re not on the infrastructure side of things, and you want to consume Ceph object infrastructure where this machinery hasn’t been configured, then you can opt for path style access by setting to true. In this configuration, requests will be sent to instead of

There is a third trick, which might be helpful in unusual scenarios, and that’s to create a bucket with uppercase characters. Because DNS isn’t case sensitive, the SDK automatically switches over to path style access.


We talked a bit about SSL/TLS in the previous section, but only insofar as how to configure things on the infrastructure side. The S3A filesystem client will default to using SSL/TLS. Changing this behavior is done by switching the fs.s3a.connection.ssl.enabled parameter to false. This is useful in scenarios like testing, or when SSL isn’t yet configured for your Ceph storage infrastructure. In a production environment, the adage “dance like nobody’s watching, and encrypt like everyone is,” still applies. Check out this blog post for a thorough walk through on configuring SSL/TLS for you Ceph object storage system.

In addition to transport encryption, objects can be encrypted. Amazon S3 provides two categories of object encryption: (1) encryption at the client and (2) server-side encryption. The S3A filesystem client does not support using client-side encryption, but it does support all three varieties of server-side encryption. The three server-side encryption options are:

  • SSE-S3: Keys managed internally by Amazon S3
  • SSE-C: Keys managed by the client and passed to Amazon S3 to encrypt/decrypt requests
  • SSE-KMS: Keys managed through Amazon’s Key Management Service (KMS)

With the advent of  Red Hat Ceph Storage 3.0, we support the SSE-C flavor of server-side encryption. Configuring the S3A filesystem client to use it is a simple affair, involving only two parameters: (1) fs.s3a.server-side-encryption-algorithm should be set to SSE-C and (2) the value of fs.s3a.server-side-encryption.key should be set to your secret key. I suspect most folks will want to do this by way of a properties file.

Upstream Ceph also has support for SSE-KMS, which is intended to be integrated with OpenStack Barbican for key management. I imagine it’s only a matter of time before this is also supported in Red Hat Ceph Storage.

Fast upload

There are two main ways of performing writes with the S3A filesystem client. The default mode buffers uploads to disk before sending a request to Amazon S3; it’s important to make sure that the directory you are buffering to is large and fast. I prefer to specify a directory supported by a local SSD. Even with fast media, this can be expensive from an IO perspective, so an alternative is to use in memory buffers. The S3A filesystem client offers two in memory buffer options: (1) one using arrays and (2) one using byte buffers. You have to be careful when using memory buffers, because it’s easy to run out if you don’t properly size both your JVM and YARN containers. If the available memory permits you to use in memory buffers, they can be much faster than their disk-based brethren.

Giving it a try

There’s an easy way to learn how to use the S3A filesystem client with Ceph, and this section will walk you through setting up a development environment using Minishift. If you’ve never used Minishift before, it’s relatively painless to set up. The guides for installation on Mac OSX, Windows, and Linux can be found here.

Once you’ve installed Minishift on your local system, drop into a terminal and fire it up.

minishift start

Now that we have a Minishift environment running, we’re going to get Ceph Nano running so we can use it as a S3A endpoint. To get started, you’ll need to download a couple of YAML files: ceph-nano.yml and ceph-rgw-keys.yml. Once those files are downloaded to your working directory, we can use the oc command line utility to deploy Ceph Nano:

oc --as system:admin adm policy add-scc-to-user anyuid \
oc create -f ceph-rgw-keys.yml
oc create -f ceph-nano.yml
oc expose pod ceph-nano-0 --type=NodePort

You should have a Ceph Nano service running at this point and may be wondering how you’re going to interact with it. Jupyter Notebooks have become wildly popular in the analytics and data-science communities because of their ability to create a single artifact, the notebook, which contains both code and documentation. As if that wasn’t enough, you can interact with them from the comfort of your web browser. Here at Red Hat, we’ve been fostering an awesome community called They’re hard at work empowering intelligent application development on OpenShift. They’ve provided a base notebook application that I used as a starting point for playing with S3A from Jupyter, because its container image is neatly bundled with Spark and PySpark libraries. You can get Jupyter up and running in your Minishift environment with just a few commands:

oc new-app \
  -e RGW_API_ENDPOINT=$(minishift openshift service ceph-nano-0 --url) \

oc env --from=secret/ceph-rgw-keys dc/ceph-notebook
oc expose svc/ceph-notebook
oc status

The “oc status” command will provide a http://ceph-notebook-myproject.$(IP) URL. Load that into your browser and follow along with your browser!

HTTPS-ization of Ceph object storage public endpoint

Hypertext Transfer Protocol Secure (HTTPS) is the secure version of HTTP, uses encrypted communication between the user and the server. HTTPS avoids Man-in-the-Middle-Attack attacks by relying on Secure Socket Layer (SSL) and Transport Layer Security (TLS) protocols to establish an encrypted connection to shuttle data securely between a client and a server.

This blog post takes you step by step through the process of adding SSL/TLS security to Ceph object storage endpoints. Ceph is a scalable, open-source object storage solution that provides Amazon S3 and SWIFT compatible APIs you can use to build your own public or private cloud object storage solution. Learn more about Ceph at Red Hat Ceph Storage.

Setting up domain record sets

Let’s begin by setting up a domain name and configuring its record sets. If you already have a domain name you’d like to use for the Ceph endpoint, great. If not, you can purchase one from any domain registrar. In this example, we’ll use the domain name

First, we need to add record sets to the domain such that the domain name can resolve into the IP address of the Ceph RADOS Gateway or the load balancer (LB) that is fronting your Ceph RADOS Gateways. To do this, you need to log in to your domain registrar dashboard and create two record sets:

  1. A type recordset with Domain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host. (If you’re using IPv6, choose AAAA type.)
  2. CNAME type recordset Wildcard Subdomain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host

Note: The reason we chose the wildcard subdomain name is important. We want to resolve all subdomains (e.g., to the same IP address, because S3 treats subdomain prefixes as the bucket name.

Domain record set additions are highlighted in the following screenshot:

DNS changes usually take a few tens of minutes to propagate, once the changes are synced. We should be able to ping domain name as well as any subdomain from anywhere on the Internet, provided ICMP Ingress traffic is allowed on the host.

karasing-OSX:~$ ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=40 time=1095.231 ms
64 bytes from icmp_seq=1 ttl=40 time=1099.570 ms
64 bytes from icmp_seq=2 ttl=40 time=1199.266 ms
64 bytes from icmp_seq=2 ttl=40 time=1199.266 ms
--- ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1095.231/1131.356/1199.266/48.053 ms
karasing-OSX:~$ ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=40 time=1491.105 ms
64 bytes from icmp_seq=1 ttl=40 time=1262.021 ms
64 bytes from icmp_seq=2 ttl=40 time=1205.943 ms
64 bytes from icmp_seq=2 ttl=40 time=1205.943 ms
--- ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1205.943/1319.690/1491.105/123.352 ms

SSL certificate: Installation and setup

A SSL certificate is a set of encrypted files that binds an organization’s identity, domain name, IP address, and cryptographic keys. Once these SSL certificates are installed on the host server, a secure connection is allowed between the host server and the client machine. In a client’s Internet browser, a green padlock will appear next to the URL as a visual cue to users that traffic is protected.

Some organizations might desire a more advanced certificate that requires additional validation. These SSL certificates must be purchased from a trusted Certificate Authority (CA). For the sake of demonstration in this example, we’ll use Let’s Encrypt which is a certificate authority that provides free X.509 certificates for Transport Layer Security (TLS) encryption. [ credits: [wikipedia](’s_Encrypt) & thanks Let’s Encrypt for your free service]

Next, we’ll install epel-release and certbot, a CLI tool for requesting SSL certificates, from Let’s Encrypt CA.

yum install -y
yum install -y certbot

Request for SSL certificate using certbot CLI client

certbot certonly --manual -d -d * --agree-tos
manual-public-ip-logging-ok --preferred-challenges dns-01 --server

Note: The first -d option in certbot CLI represent base domain, the subsequent -d options represents sub-domains.

Important: Make sure you are using wildcard (*) for subdomain, because we are requesting a wildcard subdomain SSL certification from Let’s Encrypt. If a certificate is only issued for base domain,  it will not be compatible with subdomain prefix notation.

The following snippet shows the output of certbot CLI command. The DNS challenge method will generate two DNS TXT records, which must be added as a TXT record set for your domain (from the domain registrar dashboard):

[root@ceph-admin ~]# certbot certonly --manual -d -d * --agree-tos
--manual-public-ip-logging-ok --preferred-challenges dns-01 --server
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Starting new HTTPS connection (1):
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for
dns-01 challenge for

Please deploy a DNS TXT record under the name with the following value:

BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4   <== This Value

Before continuing, verify the record is deployed.

Press Enter to continue

Please deploy a DNS TXT record under the name with the following value:

O-_g-eeu4cSI0xXSdrw3OBrWVgzZXJC59Xjkhyk39MQ    <== This Value

Before continuing, verify the record is deployed.


Keep the certbot CLI command running, and open your domain registrar dashboard. Deploy a new DNS TXT record under the name, and enter both these values (marked with <== in the preceding output) in two separate lines inside double quotes. The following screenshot shows how:

Keep the certbot CLI command running and, once you’ve successfully added DNS TXT record in your domain record set, open another terminal. Now, we’ll verify if DNS TXT records are applied on your domain by running the command host -t txt You should be able to see the same DNS TXT as mentioned in certbot CLI. If you do not, wait for a few minutes for DNS synchronization to occur.

karasing-OSX:~$ host -t txt descriptive text
"BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4" descriptive text

Once you’ve verified DNS TXT records are applied to your domain, return to certbot CLI and press Enter to continue. You will be notified that the system is waiting for verification and cleaning up challenges. You should then see the following message:

Congratulations! Your certificate and chain have been saved at:
Your key file has been saved at:
Your cert will expire on 2018-10-07. To obtain a new or tweaked
version of this certificate in the future, simply run certbot
again. To non-interactively renew *all* of your certificates, run
"certbot renew"

If you like Certbot, please consider supporting our work by:

Donating to ISRG / Let's Encrypt:
Donating to EFF:          

[root@ceph-admin ~]#

At this point, SSL certificates have been issued to your domain/host, so now we must verify them.

[root@ceph-admin ~]# ls -l /etc/letsencrypt/live/
total 4
lrwxrwxrwx 1 root root  35 Jul 9 19:06 cert.pem -> ../../archive/
lrwxrwxrwx 1 root root  36 Jul 9 19:06 chain.pem -> ../../archive/
lrwxrwxrwx 1 root root  40 Jul 9 19:06 fullchain.pem -> ../../archive/
lrwxrwxrwx 1 root root  38 Jul 9 19:06 privkey.pem -> ../../archive/
-rw-r--r-- 1 root root 682 Jul  9 19:06 README
[root@ceph-admin ~]#

Installing LB for object storage service

Many like HAProxy, because it’s easy and does its job well. In this example, we will use HAProxy to perform SSL termination for our domain name (Ceph object storage endpoint).

Note: Starting with Red Hat Ceph Storage 2, Ceph RADOS Gateway natively supports TLS by relying on OpenSSL Library. You can get more information on native SSL/TLS configuration here. In this example, we specifically choose to terminate SSL at HAProxy level. This gives us an advantage like when we have multiple instances of Ceph RGW we do not need to get multiple domain names/SSL certificates for each of them. One domain name with SSL termination at LB does the job.

Next, we’ll install HAproxy on the same host whose public IP has been bound with your domain name, in this case,

yum install -y haproxy

Next, we’ll create a certs directory.

mkdir -p /etc/haproxy/certs

Combine certificate files fullchain.pem and privkey.pem into a single file for our domain next.

DOMAIN='' sudo -E bash -c 'cat /etc/letsencrypt/live/$DOMAIN/fullchain.pem
/etc/letsencrypt/live/$DOMAIN/privkey.pem > /etc/haproxy/certs/$DOMAIN.pem'

The next step is to change the permission of the certs directory.

chmod -R go-rwx /etc/haproxy/certs

Optionally, you can move the haproxy.cfg original file and create a new config file with the following configuration settings:

mv /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.orig
vim /etc/haproxy/haproxy.cfg
    log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/
    maxconn     4000
    user        haproxy
    group       haproxy
    tune.ssl.default-dh-param 2048     
    stats socket /var/lib/haproxy/stats
    mode                    http
    log                    global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except
    option                  redispatch
    option httpchk HEAD /
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000
frontend www-http
   reqidel                      ^X­Forwarded­For:.*
   reqadd X-Forwarded-Proto:\ http
   default_backend www-backend
   option  forwardfor
frontend www-https
   bind ssl crt /etc/haproxy/certs/
   reqadd X-Forwarded-Proto:\ https
   acl letsencrypt-acl path_beg /.well-known/acme-challenge/
   use_backend letsencrypt-backend if letsencrypt-acl
   default_backend www-backend
backend www-backend
   redirect scheme https if !{ ssl_fc }
   server ceph-admin check inter 2000 rise 2 fall 5
backend letsencrypt-backend
   server letsencrypt

As you may have noted, we’re using a couple of non-default parameters in the haproxy config file, such as:

  • tune.ssl.default-dh-param is required to provide OpenSSL the necessary parameters for the SSL/TLS handshake.
  • frontend www-http binds haproxy to port 80 of the local machine and redirects traffic to the default backend www-backend. If a user uses HTTP protocol in the request, it should redirect to HTTPS.
  • frontend www-https binds haproxy to port 443 of the local machine. It also redirects traffic to the default backend www-backend and uses the SSL certificate path to encrypt/terminate the traffic.
  • frontend www-https also uses letsencrypt-backend if you want to auto-renew the SSL certificate from Let’s Encrypt CA.
  • backend www-backend simply redirects all the SSL terminated traffic to Ceph RGW node For HA and performance, you must have multiple Ceph RGW instances whose IPs should be added in the same backend section so that HAProxy can load balance among Ceph RGW instances.

Finally, we restart HAproxy and verify its listening on ports 80 and 443:

systemctl start haproxy
systemctl status haproxy ; netstat -plunt | grep -i haproxy

Configuring Ceph RGW

Up to this point, we’ve configured the domain name, set required record sets, generated SSL certificate, configured HAProxy to encrypt/terminate SSL, and redirected traffic to the Ceph RGW instance. Now we need to configure Ceph RGW to listen on port 8081 as configured in HAproxy. To do so, in the Ceph RGW node edit /etc/ceph/ceph.conf and update the client.rgw section as shown in the following:

host = ceph-admin
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-admin/keyring
log file = /var/log/ceph/ceph-rgw-ceph-admin.log
rgw frontends = civetweb port= num_threads=512
rgw resolve cname = true
rgw dns name =

Note: rgw resolve cname = true forces rgw to use the DNS CNAME record of the request hostname field (if the hostname is not equal to rgw dns name).

Note: rgw dns name = is the DNS name of the served domain.

Now, we’ll restart the Ceph RGW instance and verify its listening on port 8081.

systemctl restart ceph-radosgw@rgw.ceph-admin.service
netstat -plunt | grep -i rados

Accessing Ceph object storage secure endpoint

To test the HTTPS-enabled Ceph object storage URL, execute the following curl command or type in any web browser:


It should yield output like the following:

[student@ceph-admin ~]$ curl
[student@ceph-admin ~]$
[student@ceph-admin ~]$

Let’s try accessing Ceph object storage using S3cmd:

yum install -y s3cmd

Configure S3cmd CLI by providing config options like access/secret keys, Ceph S3 secure endpoints in host/host-bucket parameters.

s3cmd --access_key=S3user1 --secret_key=S3user1key
--host-bucket="%(bucket)" --dump-config > /home/student/.s3cfg

Note: By default, s3cmd uses HTTPS connection, so there is no need to explicitly specify that.

Next, we’ll interact with Ceph object storage using s3cmd ls, s3cmd mb commands:

[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
[student@ceph-admin ~]$
[student@ceph-admin ~]$ s3cmd mb s3://secure_bucket
Bucket 's3://secure_bucket/' created
[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
2018-07-09 19:55  s3://secure_bucket
[student@ceph-admin ~]$

Congratulations! You’ve successfully secured your Ceph object storage endpoint using you domain name of your choice and SSL certificates.


As you can see, acquiring and setting up SSL certificates involves some careful configuration and depends on your chosen CA (how Easy & Fast to acquire SSL certificate). With initiatives like “HTTPS Everywhere,” it’s no longer just web sites hosting deliverable content that must have SSL; API and service endpoints should also offer encrypted transport.

Note: HTTPS is designed to prevent eavesdropping and Man-in-the-Middle-Attacks. Always practice defense in depth. Multiple layers of security are needed to more fully secure your web site/service endpoints.

Why are customers choosing Red Hat’s Container-Native Storage in the public cloud with OpenShift?

By Sayandeb Saha, Director, Product Management, Storage Business Unit

In our last blog post in this series, we talked about how the Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen increased customer adoption in on-premise environments by offering a peaceful coexistence approach with classic storage arrays that are not deeply integrated with OpenShift. In this post, we’ll explore why many customers are deploying our CNS offering in the three big public clouds—AWS, Microsoft Azure, and the Google Cloud Platform—on top of native public cloud offerings from the public clouds—despite good integration of Kubernetes with native storage offerings in the cloud. Let’s examine some of these problems and constraints in a bit more detail and describe how CNS addresses them.

Slow attach/detachpoor availability

The first issue stems from the fact that the native block storage offerings (EBS in AWS, Data Disk in Azure, Persistent Disk in Google Cloud) in the public cloud were designed and engineered to support virtual machine (VM) workloads. In such workloads, attaching and consequently detaching a block device to a machine image/instance is an infrequent occurrence at best, as these workloads are less dynamic compared to Platform-as-a-Service (PaaS) and DevOps workloads, which frequently run on OpenShift powering dynamic build and deploy CI/CD pipelines and other similar workloads and workflows.

Some of our customers found that attach and detach times for these block devices, when directly accessed from OpenShift workloads using the native kubernetes storage provisioners, are unacceptable because they led to poor startup times for pods (slow attach) and limited or no high availability on a failover, which usually triggers a sequence that includes a detach operation, an attach operation, and a subsequent mount operation.

Each of these operations usually triggers a variety of API calls specific for the public cloud provider. Any or all of these intermediate steps can fail, causing users to lose access to storage persistent volumes (PVs) for their compute pods for an extended period. Overlaying Red Hat’s CNS offering as a storage management fabric to aggregate, pool, and serve out PVs expediently without worrying about the status of individual cloud native block storage (a.k.a EBS or Azure Data Disk) can provide major relief, because it effectively isolates the lifecycle of cloud-native block storage devices from that of the application pods allocating and deallocating PVs dynamically as application teams work on OpenShift. This isolation effectively addresses this issue.

Block device limits per compute instance

The second issue some of our customers run into is the fact that there is a limit to the number of block devices that one can attach to the machine images or instances in various public cloud environments.

OpenShift supports a maximum of 250 containers per host. The maximum number of block devices that are supported to be attached to machine instances per account is far fewer (for example, max 40 EBS devices per EC2 instance). Even though it is unusual to have a 1:1 mapping between containers and storage devices, this low maximum can lead to a lot of unintended behavior, notwithstanding the fact that it leads to a higher total cost of ownership (need more hosts than necessary).

For example, in a failover scenario during the detach, attach, and mount sequence, the API call to attach might fail, because there are already a maximum number of devices attached to the EC2 instance where this attempt is being made, which can cause a glitch/outage. Overlaying Red Hat’s CNS offering as a storage management fabric on cloud-based block devices mitigates the impact of hitting the maximum number of devices that can be attached to a machine image or instance, because storage is served out from a pool that is unencumbered by individual max device per instance/host limit. Storage can continue to be served out until the entire pool is exhausted which, at that time, can be expanded by adding new hosts and devices.

Cross-AZ storage availability

The third issue arises from the fact that cloud block storage devices are usually accessible within a specific Availability Zone (AZ) in AWS or Availability Sets in Azure. AZs are like failure domains in public clouds.

Most customers who deploy OpenShift in the public cloud do so to span more than one AZ for high availability. This is done so that when one AZ dies or goes offline, the OpenShift cluster remains operational. Using block devices constrained to an AZ for providing storage services to OpenShift workloads can defeat the purpose, because then containers must be scheduled within hosts that belong to the same AZ, and customers can not leverage the full power of Kubernetes orchestration. This configuration could also lead to an outage when an AZ goes offline.

Our customers use CNS to mitigate this problem so that even when there is an AZ failure, a three-way replicated cross-AZ storage service (CNS) is available for containerized applications to avoid downtimes. This also enables Kubernetes to schedule pods across AZs (instead of within an AZ), thereby preserving the spirit of the original fault-tolerant OpenShift deployment architecture that spans multiple AZs.

Cost-effective storage consolidation

Storage provided by CNS is efficiently allocated and offers performance with the first gigabyte provisioned, thereby enabling storage consolidation. For example, consider six MySQL database instances, each in need of 25 GiB of storage capacity and up to 1500 IOPS at peak load. With EBS in AWS, one would create six EBS volumes, each with at least 500 GiB capacity out of the gp2 (General Purpose SSD) EBS tier, in order to get 1500 IOPS. The level of performance is tied to provisioned capacity with EBS.

With CNS, one can achieve the same level using only 3 EBS volumes at 500 GiB capacity from the gp2 tier and run these with GlusterFS. One would create six 25 GiB volumes and provide storage to many databases with high IOPS performance, provided they don’t peak all at the same time. Doing that, one would halve EBS cost and still have capacity to spare for other services. Read IOPS performance is likely even higher, because in CNS with three-way replication as data is read from distributed across 3×1500 IOPS gp2 EBS volumes.

Check us out for more

As you can see, there’s a good case to be made for using CNS in various public clouds for a multitude of technical reasons our customers care about, besides the fact that Red Hat CNS provides a consistent storage consumption and management experience across hybrid and multi clouds (see the following figure).


Red Hat CNS runs anywhere and everywhere Red Hat OpenShift Container Platform runs.

In addition to the application portability that OpenShift already provides across hybrid and multi clouds, we’re working on multi cloud replication features that would enable CNS to effectively become the data fabric that enables data portability—another good reason to select and stay with CNS. Stay tuned for more information on that!

For hands-on experience now combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.

What about locality?

This is the first post of a multi-part series of technical blog posts on Spark on Ceph:

  1. What about locality?
  2. Anatomy of the S3A filesystem client
  3. To the cloud!
  4. Storing tables in Ceph object storage
  5. Comparing with HDFS—TestDFSIO
  6. Comparing with remote HDFS—Hive Testbench (SparkSQL)
  7. Comparing with local HDFS—Hive Testbench (SparkSQL)
  8. Comparing with remote HDFS—Hive Testbench (Impala)
  9. Interactive speedup
  10. AI and machine learning workloads
  11. The write firehose

Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:

“Will performance suck because the benefits of locality are lost?”

It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.

Historical context

Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.

Cluster of 1800 machines, [each with]:

  • 4GB of memory
  • Dual-processor 2 GHz Xeons with hyperthreading
  • Dual 160GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth of 100 Gb/s

If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.

The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.

Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.

While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.


When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.


Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.

One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.

Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.

In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.

In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.


Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.

Why Spark on Ceph? (Part 3 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we in the Red Hat Storage solutions architecture team wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (this blog post)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series, providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Findings summary

We did Ceph vs. HDFS testing with a variety of workloads (see blog Part 2 of 3 for general workload descriptions). As expected, the price/performance comparison varied based on a number of factors, summarized below.

Clearly, many factors contribute to overall solution price. As storage capacity is frequently a major component of big data solution price, we chose it as a simple proxy for price in our price/performance comparison.

The primary factor affecting storage capacity price in our comparison was the data durability scheme used. With 3x replication data durability, a customer needs to buy 3PB of raw storage capacity to get 1PB of usable capacity. With erasure coding 4:2 data durability, a customer only needs to buy 1.5PB of raw storage capacity to get 1PB of usable capacity. The primary data durability scheme used by HDFS is 3x replication (support for HDFS erasure coding is emerging, but is still experimental in several distributions).  Ceph has supported either erasure coding or 3x replication data durability schemes for years. All Spark-on-Ceph early adopters we worked with are using erasure coding for cost efficiency reasons. As such, most of our tests were run with Ceph erasure coded clusters (we chose EC 4:2). We also ran some tests with Ceph 3x replicated clusters to provide apples-to-apples comparison for those tests.

Using the proxy for relative price noted above, Figure 1 provides an HDFS v. Ceph price/performance summary for the workloads indicated:

Figure 1: Relative price/performance comparison, based on results from eight different workloads

Figure 1 depicts price/performance comparisons based on eight different workloads. Each of the eight individual workloads was run with both HDFS and Ceph storage back-ends. The storage capacity price of the Ceph solution relative to the HDFS solution is either the same or 50% less. When the workload was run with Ceph 3x replicated clusters, the storage capacity price is shown as the same as HDFS. When the workload was run with Ceph erasure coded 4:2 clusters, the Ceph storage capacity price is shown as 50% less than HDFS. (See the previous discussion on how data durability schemes affect solution price.)

For example, workload 8 had similar performance with either Ceph or HDFS storage, but the Ceph storage capacity price was 50% of the HDFS storage capacity price, as Ceph was running an erasure coded 4:2 cluster. In other examples, workloads 1 and 2 had similar performance with either Ceph or HDFS storage and also had the same storage capacity price (workloads 1 and 2 were run with a Ceph 3x replicated cluster).

Findings details

A few details are provided here for the workloads tested with both Ceph and HDFS storage, as depicted in Figure 1.

  1. This workload was a simple test to compare aggregate read throughput via TestDFSIO. As shown in Figure 2, this workload performed comparably between HDFS and Ceph, when Ceph also used 3x replication. When Ceph used erasure coding 4:2, the workload performed better than either HDFS or Ceph 3x for lower numbers of concurrent clients (<300). With more client concurrency, however, the workload performance on Ceph 4:2 dropped due to spindle contention (a single read with erasure coded 4:2 storage requires 4 disk accesses, vs. a single disk access with 3x replicated storage.)

    Figure 2: TestDFSIO read results
  2. This workload compared the SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2.

    Figure 3: Single-user Spark query set results
  3. This workload compared Impala query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was 57% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  4. This mixed workload featured concurrent execution of a single-user running SparkSQL queries (54), 10-users each running Impala queries (54 each), and a data set merge/join job enriching TPC-DS web sales data with synthetic clickstream logs. As illustrated in Figure 1, the aggregate execution time of this mixed workload on Ceph EC4:2 was 48% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  5. This workload was a simple test to compare aggregate write throughput via TestDFSIO. As depicted in Figure 1, this workload performed, on average, 50% slower on Ceph EC4:2 compared to HDFS, across a range of concurrent client/writers. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  6. This workload compared SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2. However, price/performance was nearly comparable when running against Ceph EC4:2, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  7. This workload featured enrichment (merge/join) of TPC-DS web sales data with synthetic clickstream logs, and then writing the updated web sales data. As depicted in Figure 4, this workload was 37% slower on Ceph EC4:2 compared to HDFS. However, price/performance was favorable for Ceph, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.

    Figure 4: Data set enrichment (merge/join/update) job results
  8. This workload compared the SparkSQL query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was roughly comparable to that of HDFS, despite requiring only 50% the storage capacity costs. Price/performance for this workload thus favors Ceph by 2x. For more insight into this workload performance, see Figure 5. In this box-and-whisker plot, each dot reflects a single SparkSQL query execution time. As each of the 10-users concurrently executes 54 queries, there are 540 dots per series. The three series shown are Ceph EC4:2 (green), Ceph 3x (red), and HDFS 3x (blue). The Ceph EC4:2 box shows comparable median execution times to HDFS 3x, and shows more consistent query times in the middle 2 quartiles.
Figure 5: Multi-user Spark query set results

Bonus results section: 24-hour ingest

One of our prospective Spark-on-Ceph customers recently asked us to illustrate Ceph cluster sustained ingest rate over a 24-hour time period. For these tests, we used variations of the lab as described in blog 2 of 3. As noted in Figure 6, we measured a raw ingest rate of approximately 1.3 PiB per day into a Ceph EC4:2 cluster configured with 700 HDD data drives (Ceph OSDs).

Figure 6: Daily data ingest rate into Ceph clusters of various sizes

Concluding observations

In conclusion, below is our formative cost/benefit analysis of the above results summarizing this blog series.

  • Benefits, Spark-on-Ceph vs. Spark on traditional HDFS:
    1. Reduce CapEx by reducing duplication: Reduce PBs of redundant storage capacity purchased to store duplicate data sets in HDFS silos, when multiple analytics clusters need access to the same data sets.
    2. Reduce OpEx/risk: Eliminate costs of scripting/scheduling data set copies between HDFS silos, and reduce risk-of-human-error when attempting to maintain consistency between these duplicate data sets on HDFS silos, when multiple analytics clusters need access to the same data sets.
    3. Accelerate insight from new data science clusters: Reduce time-to-insight when spinning-up new data science clusters by analyzing data in-situ within a shared data repository, as opposed to hydrating (copying data into) a new cluster before beginning analysis.
    4. Satisfy different tool/version needs of different data teams: While sharing data sets between teams, enable users within each cluster to choose the Spark/Hadoop tool sets and versions appropriate to their jobs, without disrupting users from other teams requiring different tools/versions.
    5. Right-size CapEx infrastructure costs: Reduce over-provisioning of either compute or storage common with provisioning traditional HDFS clusters, which grow by adding generic nodes (regardless if only more CPU cores or storage capacity is needed), by right-sizing compute needs (vCPU/RAM) independently from storage capacity needs (throughput/TB).
    6. Reduce CapEx by improving data durability efficiency: Reduce CapEx of storage capacity purchased by up to 50% due to Ceph erasure coding efficiency vs. HDFS default 3x replication.
  • Costs, Spark-on-Ceph vs. Spark on traditional HDFS:

    1. Query performance: Performance of Spark/Impala query jobs ranged from 0%-131% longer execution times (single-user and multi-user concurrency tests).
    2. Write-job performance: Performance of write-oriented jobs (loading, transformation, enrichment) ranged from 37%-200%+ longer execution times. [Note: Significant improvements in write-job performance are expected when downstream distributions adopt the following upstream enhancements to the Hadoop S3A client HADOOP-13600, HADOOP-13786, HADOOP-12891].
    3. Mixed-workload Performance: Performance of multiple query and enrichment jobs concurrently executed resulted in 90% longer execution times.

For more details (and a hands-on chance to kick the tires of this solution yourself), stay tuned for the architect-level blog series in this same Red Hat Storage blog location. Thanks for reading.

Why Spark on Ceph? (Part 2 of 3)


A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (this blog post)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph (Part 3 of 3)“)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Basic analytics pipeline using a Ceph object store

Our early adopter customers are ingesting, querying, and transforming data directly to and from a shared Ceph object store.  In other words, target data locations for their analytics jobs are something like “s3://bucket-name/path-to-file-in-bucket” within Ceph, instead of something like “hdfs:///path-to-file”.  Direct access to S3-compatible object stores via analytics tools like Spark, Hive, and Impala is made possible via the Hadoop S3A client.

Jointly with several customers, we successfully ran 1000s of analytics jobs directly against a Ceph object store using the following analytics tools:

Figure 1: Analytics tools tested with shared Ceph object store

In addition to running simplistic tests like TestDFSIO, we wanted to run analytics jobs which were representative of real-world workloads. To do that, we based our tests on the TPC-DS benchmark for ingest, transformation, and query jobs. TPC-DS generates synthetic data sets and provides a set of sample queries intended to model the analytics environment of a large retail company with sales operations from stores, catalogs, and the web. Its schema has 10s of tables, with billions of records in some tables. It defines 99 pre-configured queries, from which we selected the 54 most IO-intensive for out tests. With partners in industry, we also supplemented the TPC-DS data set with simulated click-stream logs, 10x larger than the TPC-DS data set size, and added SparkSQL jobs to join these logs with TPC-DS web sales data.

In summary, we ran the following directly against a Ceph object store:

  • Bulk Ingest (bulk load jobs – simulating high volume streaming ingest at 1PB+/day)
  • Ingest (MapReduce jobs)
  • Transformation (Hive or SparkSQL jobs which convert plain text data into Parquet or ORC columnar, compressed formats)
  • Query (Hive or SparkSQL jobs – frequently run in batch/non-interactive mode, as these tools automatically restart failed jobs)
  • Interactive Query (Impala or Presto jobs)
  • Merge/join (Hive or SparkSQL jobs joining semi-structured click-stream data with structured web sales data)

Architecture overview

We ran variations of the tests outlined above with 4 large customers over the past year. Generally speaking, our architecture looked something like this:

Figure 2: High-level lab architecture

Did it work?

Yes. 1000s of analytics jobs described above completed successfully. SparkSQL, Hive, MapReduce, and Impala jobs all using the S3A client to read and write data directly to a shared Ceph object store. The related architect-level blog series will document detailed lessons learned and configuration techniques.

In the final episode of this blog series, we’ll get to the punch line – what was the performance compared to traditional HDFS? For the answer, continue to Part 3 of this series….

Why Spark on Ceph? (Part 1 of 3)

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (this blog post)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph? (Part 3 of 3)”)

We’ll provide summary-level answers to these questions in a 3-part blog series. In addition, for those wanting more depth, we’ll cross-link to a separate reference architecture blog series providing detailed descriptions, test data, and configuration scenarios, and we recorded this podcast with Intel, in which we talk about our focus on making Spark, Hadoop, and Ceph work better on Intel hardware and helping enterprises scale efficiently.

Part 1: Why would companies do this?

Agility of many, the power of one.
The agility of many analytics clusters, with the power of one shared data store.
(Ok … enough with the simplistic couplets.)

Here are a few common problems that emerged from speaking with 30+ companies:

  • Teams that share the same analytics cluster are frequently frustrated because someone else’s job often prevents their job from finishing on-time.
  • In addition, some teams want the stability of older analytic tool versions on their clusters, while their peer teams need to load the latest-and-greatest tool releases.
  • As a result, many teams demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own needs.
  • However, each separate analytics cluster typically has its own, non-shared HDFS data store – creating data silos.
  • And to provide access to the same data sets across the silos, the data platform team frequently copies datasets between the HDFS silos, trying to keep them consistent and up-to-date.
  • As a result, companies end up maintaining many separate, fixed analytics clusters (50+ in one case), each with their own HDFS data silo containing redundant copies of PBs of data, while maintaining an error-prone maze of scripts to keep data sets updated across silos.
  • But, the resulting cost of maintaining 5, 10, or 20 copies of multi-PB datasets on the various HDFS silos is cost prohibitive to many companies (both CapEx and OpEx).

In pictures, their core problems and resulting options look something like this:

Figure 1. Core problems


Figure 2. Resulting Options

Turns out that the AWS ecosystem built a solution for choice #3 (see Figure 2 above) years ago through the Hadoop S3A filesystem client. In AWS, you can spin-up many analytics clusters on EC2 instances, and share data sets between them on Amazon S3 (e.g. see Cloudera CDH support for Amazon S3). No more lengthy delays hydrating HDFS storage after spinning-up new clusters, or de-staging HDFS data upon cluster termination. With the Hadoop S3A filesystem client, Spark/Hadoop jobs and queries can run directly against data held within a shared S3 data store.  

Bottom-line … more-and-more data scientists and analysts are accustomed to spinning-up analytic clusters quickly on AWS with access to shared data sets, without time-consuming HDFS data-hydration and de-stage cyles, and expect the same capability on-premises.

Ceph is the #1 open-source, private-cloud object storage platform, providing S3-compatible object storage. It was (and is) the natural choice for these companies looking to provide an S3-compatible shared data lake experience to their analysts on-premises.

To learn more, continue to the next post in this series, “Why Spark on Ceph? (Part 2 of 3)”Will mainstream analytics jobs run directly against a Ceph object store?

Leverage your existing storage investments with container-native storage

By Sayandeb Saha, Director, Product Management

The Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen wide customer adoption in the past year or so. Customers are deploying it in a wide variety of environments that include bare metal, virtualized, and private and public clouds. It mimics the diverse spread of environments in which OpenShift itself gets deployed—which is also CNS’s key strength (i.e., being able to back OpenShift wherever it runs—see the following graphic).

During the past of year of customer adoption of CNS, we’ve observed some key trends that are unique for OpenShift/Kubernetes storage and that we’ll highlight in a series of blogs. This blog series will also include business and technical solutions that have worked for our customers.

In this blog post, we examine a trend where customers have adopted CNS as a storage management fabric that sits in between the OpenShift Container Platform and their classic storage gear. This particular adoption pattern continues to have a really high uptake, and there are sound business and technical reasons for doing this, which we’ll explore here.

First the Solution (The What): We’ve seen a lot of customers deploying CNS to serve out storage from their existing storage arrays/SANs and other traditional storage, as illustrated in the following graphic. In this scenario, block devices from existing storage arrays are served out with our CNS software running in VMs or containers/pods to OpenShift. In this case, the storage for the VMs that runs OpenShift is still served by the arrays.

Now the Why: Initially, it seemed backward as to why customers would be doing this; after all, software-defined storage solutions like CNS are meant to run on x86 bare metal (on premise) or in the public cloud, but further investigation revealed some interesting discoveries.

While OpenShift users and ops teams consume infrastructure, they typically do not manage infrastructure. In on-premise environments, OpenShift ops teams are highly dependent on other infrastructure teams for virtualization, storage, and operating systems for the infrastructure on which they run OpenShift. Similarly, in public clouds they consume the native compute and storage infrastructure available in these clouds.

As a consequence, they are highly dependent on storage infrastructure that is already in place. Typically, it’s very difficult to justify a storage server purchase when storage has been already procured a year or more ago from a traditional storage vendor for a new use case (OpenShift storage in this case). The issue is that this traditional storage was not designed for nor intended to be used with containers and the budget for storage has mostly been spent. This has driven the OpenShift operations teams to adopt CNS effectively as a storage management fabric that sits between their OpenShift Container Platform deployment and their existing storage array. The inherent flexibility of Red Hat Gluster Storage in this case is the form of CNS being leveraged, which enables it to aggregate and pool block devices that are attached to a VM and serve that out to OpenShift workloads. OpenShift operations teams can now have the best of both worlds. They can repurpose their existing storage array that is already in place/on premise but actually consume CNS which operates as a management fabric offering the latest and greatest in terms of feature, functionality, and manageability with a deep integration with the OpenShift platform.

In addition to business reasons, there are also various technical reasons that these OpenShift operations teams are adopting CNS. These include, but are not limited to:

  • Lack of deep integration of their existing storage arrays with OpenShift Container Platform
  • Even if their traditional storage array has rudimentary integration with OpenShift, very likely it has limited feature support, which renders it unusable with many OpenShift workloads (like lack of dynamic provisioning)
  • The roadmap of their storage arrays vendor may not match their current (or future) OpenShift/Kubernetes storage feature support needs, like lack of availability of a Persistent Volume (PV) resize feature
  • Needing a fully featured OpenShift Storage solution for OpenShift workloads as well as the OpenShift infrastructure itself. Many existing storage platforms can support one or the other, but not both. For instance, a storage array serving out Fiber Channels LUNs (plain block storage) can’t back an OpenShift registry as one needs shared storage access for it usually provided by a file or object storage back end.
  • They seek a consistent storage consumption and management experience across hybrid and multiple clouds. Once they learn to implement and manage CNS from Red Hat in one environment, it’s repeatable in all other environments. They can’t use their storage array in the public cloud.

Using CNS from Red Hat is a win for OpenShift ops teams. They can get started with a state-of-the-art storage back end for OpenShift apps and infrastructure without needing to acquire new infrastructure for OpenShift Storage right away. They have the option to move to x86-based storage servers during the following budget cycle as they grow their OpenShift footprint and onboard more apps and customers to it. The experience with CNS serves them well if they choose to implement OpenShift and CNS in other environments like AWS, Azure, and Google Cloud.

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.

Red Hat Summit 2018—It’s a wrap!

By Will McGrath, Product Marketing, Red Hat Storage

Wowzer! Red Hat Summit 2018 was a blur of activity. The quality and quantity of conversations with customers, partners, industry analysts, community members, and Red Hatters was unbelievable. This event has grown steadily the past few years to over 7,000 registrants this year. From a Storage perspective, this was the largest presence ever in terms of content and customer interaction.

Key announcements

For Storage, we made two key announcements during Red Hat Summit. The first was around Red Hat Storage One, a pre-configured offering engineered with our server partners, announced last week. If you didn’t catch Dustin Black’s blog post that goes into the detail of the solution, check it out .

The second announcement, which occurred this week, highlighted the momentum in building a storage offering that provides a seamless developer experience and unified orchestration for containers. There are now more than 150 customers worldwide that have adopted Red Hat’s container-native storage solutions to enable their transition to the hybrid cloud, including Vorwerk and innovation award winner Lufthansa Technik.  

We featured a number of customer success stories, including Massachusetts Open Cloud, which worked with Boston Children’s Hospital to redefine medical-image processing using Red Hat Ceph Storage.

If you’d like to keep up on the containers news, check out our blog post from Tuesday and this week’s news around CoreOS integration into Red Hat OpenShift. You might also like to check out the news around customers deploying OpenShift on Red Hat infrastructureincluding OpenStackthrough container-based application development and tightly integrated cloud technologies.

Storage expertise on display

On the morning of the first day of Summit, Burr Sutter and team demoed a number of technologies, including Red Hat Storage, to showcase application portability across the open hybrid cloud. This morning, Erin Boyd and team ran some way cool live demos that showed the power of microservices and functions with OpenShift, Storage, OpenWhisk, Tensorflow, and a host of technologies across the hybrid cloud.

For those who had the opportunity to attend any of the 20+ Red Hat Summit storage sessions, you were able to learn how our Red Hat Gluster Storage and Red Hat Ceph Storage products appeal to both traditional and modern users. The roadmap presentations by both Neil Levine (Ceph) and Sayan Saha (Gluster and container-native storage) were very popular. Sage Weil, the creator of Ceph, gave a standing-room only talk on the future of storage. Some of these storage sessions will be available on the Red Hat Summit YouTube channel in the coming weeks.

We also had several partners demoing their combined solutions with Red Hat Storage, including Intel, Mellanox, Penguin Computing, QCT, and Supermicro. Commvault had a guest appearance during Sean Murphy’s Red Hat Hyperconverged Infrastructure talk, explaining what led them to decide to include it in their HyperScale Appliances and Software offerings.

This year, we conducted an advanced Ceph users’ group meeting the day before the conference with marquee customers participating in deep-dive discussions with product and community leaders. During the conference, the storage lockers have been a hit. We had great presence on the show floor, including the community booths. Our breakfast was well attended with over a hundred people registered and featured a panel of customers and partners.

Continue the conversation

During his appearance on The Cube by Silicon Angle, Red Hat Storage VP/GM Ranga Rangachari talked about his point of view on “UnStorage.” This idea, triggered by his original blog post on the subject, made quite a few waves at the event. Customers and analysts are responding positively to the idea of a new approach to storage in the age of hybrid cloud, hyperconvergence, and containers. Today is the last day to win prizes by tweeting  @RedHatStorage with the hashtag #UnStorage.

If you missed us in San Francisco, we’ll be at OpenStack Summit in Vancouver from May 21-24. Red Hat is a headline sponsor at Booth A19. If you’re attending, come check out our OpenStack and Ceph demo, and check back on our blog page for news from the event. We’ll also be hosting the “Craft Your Cloud” event on Tuesday, May 22, from 6-9 pm at Steamworks in Vancouver. For more information and to register, click here. For more fun and networking opportunities, join the Ceph and RDO communities for a happy hour on May 23 from 6-8 pm at The Portside Pub in Vancouver. For more information and to register for that event, click here.

On to Red Hat Summit 2019

You can check out the videos and keynotes from Red Hat Summit 2018 on demand. Next year, Red Hat Summit is being held in Boston againit’s been rotating between San Francisco and Bostonso if you couldn’t attend San Francisco this year we urge you to plan to visit us in Boston next year. We hope you enjoyed our coverage of Red Hat Summit 2018, and hope to see you in 2019.

More accolades for Red Hat Ceph Storage

By Daniel Gilfix, Product Marketing, Red Hat Storage

Once again, an independent analytic news source has confirmed what many of you already know: that Red Hat Ceph Storage stands alone in its commitment to technical excellence for the customers it serves. In the latest IT Brand Pulse survey covering Networking & Storage products, IT professionals from around the world have selected Red Hat Ceph Storage as the “Scale-out Object Storage Software” leader in all categories. This includes price, performance, reliability, service and support, and innovation. The honors follow a pattern of recognition from IT Brand Pulse, having bestowed the leadership tag to Red Hat Ceph Storage in 2017, 2015, and 2014, with 2016 noted for Red Hat as “Service and Support” leader.

The report documented the results of the independent, March 2018, annual survey that polled vendors on their perception of excellence in eleven different categories. Red Hat Ceph Storage earned ratings that were visibly head and shoulders above the competition, including more than a 2X differential over Scality and VMware.

Source: IT Brand Pulse,

It feels like just yesterday!

This latest third party validation comes on the heels of Red Hat Ceph Storage being named as a finalist in Storage Magazine and SearchStorage’s 2017 Products of the Year competition in late January 2018. Here, the evaluation was based on Red Hat Ceph Storage v2.3, one that made great strides in the areas of connectivity and containerization, including an NFS gateway to an S3-compatible object interface and compatibility with the Hadoop S3A plugin.

Red Hat Ceph Storage 3 carries the baton

IT professionals voting in this year’s IT Brand Pulse survey were able to consider newer features in the important Red Hat Ceph Storage 3 release that addressed a series of major customer challenges in object storage and beyond. We delivered full support for file-based access via CephFS, expanded ties to legacy storage environments through iSCSI, pumped fuel into our containerization options with CSDs for 25% hardware deployment savings, and introduced an easier monitoring interface and additional layers of automation for more self-maintaining deployments.  

See you at Red Hat Summit!

Ceph booth at Red Hat Summit 2018

As usual, the real testament to our success is the continued satisfaction of our customer base, the ones who are increasingly choosing Red Hat Ceph Storage for modern use cases like AI and ML, rich media, data lakes, hybrid cloud infrastructure based on OpenStack, and traditional backup and restore.

Ceph user group at Red Hat Summit 2018

We look forward to discussing deployment options and whether Red Hat Ceph Storage might be right for you this week at Red Hat Summit—There’s still so much more to go! Catch us at one of the following sessions in Moscone West:

Today (Wednesday, May 9)

Tomorrow (Thursday, May 10)

Container-native storage from Red Hat is on a roll at Red Hat Summit 2018!

By Steve Bohac, Product Marketing, Red Hat Storage

It’s Red Hat Summit week, with this year’s edition taking place in San Francisco! As always, Red Hat has a plethora of announcements this week.

If you haven’t already heard the news, yesterday we announced substantial customer adoption momentum with container-native storage from Red Hat. Customers such as Lufthansa Technik, Aragonesa de Servious Telematico (AST), Generali Switzerland, IHK-GfI, and Vorwerk (amongst many more) are using Red Hat OpenShift Container Platform for cloud-native applications and are representative of how organizations are seeking out scalable, fully integrated, developer friendly storage for containers.

Based on Red Hat Gluster Storage, container-native storage from Red Hat offers these organizations scalable, persistent storage for containers across hybrid clouds with increased application portability. Tightly integrated with Red Hat OpenShift Container Platform, container-native storage from Red Hat can be used to persist not only application data but data for logging, metrics, and the container registry. The deep integration with Red Hat OpenShift Container Platform helps developers easily provision and manage elastic storage for applications and offers a single point of support. Customers use container-native storage to persist data for a variety of applications, including SQL and NoSQL databases, CI/CD tools, web serving, and messaging applications.

Organizations using container-native storage from Red Hat can benefit from simplified management, rapid deployment, and a single point of support. The versatility of container-native storage from Red Hat can enable customers to run cloud-native applications in containers, on bare metal, in virtualized environments, or in the public cloud.

For those of you attending Red Hat Summit this week, as always we know you love breakout sessions to learn more about Red Hat solutions—and we have a bunch covering container-native storage from Red Hat! Don’t forget to get your raffle tickets at each of the storage sessions you attend. Here’s what the line up for container-native storage from Red Hat sessions looks like:

(All in Moscone West unless otherwise noted)

Tuesday, May 8

Thursday, May 10

Want to learn more?

For hands-on experience combining OpenShift and container-native storage, check out our test drive, a free, in-browser lab experience that walks you through using both.

Happy Red Hat Summit! Hope to see you this week!




Five ways to experience UnStorage at Red Hat Summit

Welcome to Red Hat Summit 2018 in San Francisco! The Storage team has been hard at work to make this the best possible showcase of technology and customers—and have fun while doing it. This year our presence is built around the theme: UnStorage for the modern enterprise.

What is UnStorage?

Today’s users need their data so accessible, so scalable, so versatile that the old concept of “storing” it seems counterintuitive. Perhaps a better way of describing the needs of the modern enterprise is UnStorage, as outlined in this blog post by Red Hat Storage VP and GM, Ranga Rangachari.

Five ways to experience UnStorage at Red Hat Summit

  1. Content is king: We have 24 sessions packed with storage knowledge, best practices, and success stories. Over 21 Red Hat Storage customers will be featured at the event, including on a panel at our breakfast (open to all attendees) on Wednesday at 7 am at the Marriott Marquis. Learn how some of the most innovative enterprises leverage the power of unStorage to solve their scale and agility challenges.
  2. Without hardware partners, it’s like clapping with one hand: By definition, the success of software-defined storage hinges on the strength of the hardware ecosystem. Since the storage controller software is only half the solution, it’s important to have deep engineering investment with hardware and component vendors to build rock-solid solutions for customers. With partners like Supermicro, Mellanox, Penguin Computing, Intel, Commvault, and QCT, all featured at the conference, Red Hat Storage enables greater customer choice and openness—a key tenet of UnStorage.
  3. Explore your storage curiosity: UnStorage is all about breaking the rules to make things better. You’ll find a lot of creative ideas that are off the beaten track. Just as UnStorage is ubiquitous—it stretches across private, public, and hybrid cloud boundaries—it’s hard to miss Storage at the conference. You can find storage lockers near the expo entrance where you can drop off backpacks and charge phones while you attend sessions. Or enter to win one of two Star Wars collector edition drones by attending sessions or visiting the booth. Stop by the Storage Launch Pad to play online games, take surveys, and pick up a ton of giveaways, including two golden tickets handed out every day, which will afford you a special set of prizes.
  4. Test drive storage: Kick the tires on UnStorage with one of three test drives for Ceph, Gluster, and OpenShift Ops. As the name suggests, software-defined storage is completely decoupled from hardware, making it easy to test and deploy in the cloud. On the other side of the deployment spectrum, you can also try out the sizing tool for Red Hat Storage One, our single SKU pre-configured system announced last week. Stop by one of four Storage pods on the expo floor for demos and conversations with Storage experts.
  5. The proof of the pudding: Stop by Thursday’s keynote with CTO Chris Wright and live demos by Burr Sutter and team featuring container-native storage baked into Red Hat platforms such as OpenShift. UnStorage is as invisible as it is pervasive. Modern enterprises demand that storage be fully integrated into compute platforms for easier management and scale. With container-native storage surpassing 150 customers in the last year alone, learn how customers such as Schiphol, FICO, and Macquarie Bank are building next-generation hybrid clouds with Red Hat technologies.

We’re not all-work-all-the-time at Red Hat Storage, though. Join us at the community happy hour or the hybrid cloud infrastructure party on Tuesday to blow off some steam during a long week. Our social media strategist, Colleen Corrice, is running a way cool Twitter contest: All you have to do is post a picture at a Storage session or booth @RedHatStorage with the hashtag #UnStorage to receive a T-shirt and be included in a drawing for a personal planetarium.

Finally, check out this infographic on all things UnStorage @ Red Hat Summit. Please check back for a daily blog through this week. We hope to see you at Red Hat Summit 2018.