Red Hat OpenShift Container Platform 3.10 with Container-Native Storage 3.9

This post documents how to install Container-Native Storage 3.9 (CNS 3.9) with OpenShift Container Platform 3.10 (OCP 3.10). CNS provides persistent storage for OCP’s general-application consumption and for the registry.

CNS 3.9 installation with OCP 3.10 advanced installer

The deployment of CNS 3.9 can be accomplished using openshift-ansible playbooks and specific inventory file options. The first group of hosts in glusterfs specifies a cluster for general-purpose application storage and will, by default, come with the StorageClass glusterfs-storage to enable dynamic provisioning. For high availability of storage, it’s very important to have four nodes for the general-purpose application cluster, glusterfs. The second group, glusterfs_registry, specifies a cluster that will host a single, statically deployed PersistentVolume for use exclusively by a hosted registry that can scale. This cluster will not offer a StorageClass for file-based PersistentVolumes with the options and values as they are currently configured.

Following is an example of a partial inventory file with selected options concerning deployment of CNS 3.9 for applications and registry. When using options for deployment with values of specific sizes, (e.g., openshift_hosted_registry_storage_volume_size=10Gi) or node selectors, (e.g., node-role.kubernetes.io/infra=true) they should be adjusted for your particular deployment needs.

[OSEv3:children]
...
nodes
glusterfs
glusterfs_registry

[OSEv3:vars]
...      
# registry
openshift_hosted_registry_storage_kind=glusterfs       
openshift_hosted_registry_storage_volume_size=10Gi   
openshift_hosted_registry_selector="node-role.kubernetes.io/infra=true"

# Container image to use for glusterfs pods
openshift_storage_glusterfs_image="registry.access.redhat.com/rhgs3/rhgs-server-rhel7
:v3.9"
# Container image to use for gluster-block-provisioner pod
openshift_storage_glusterfs_block_image="registry.access.redhat.com/rhgs3/rhgs-gluster-
block-prov-rhel7:v3.9"
# Container image to use for heketi pods
openshift_storage_glusterfs_heketi_image="registry.access.redhat.com/rhgs3/rhgs-
volmanager-rhel7:v3.9"
 
# CNS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=false

# CNS storage cluster for OpenShift infrastructure
openshift_storage_glusterfs_registry_namespace=infra-storage  
openshift_storage_glusterfs_registry_storageclass=false       
openshift_storage_glusterfs_registry_block_deploy=false   
openshift_storage_glusterfs_registry_block_host_vol_create=false    
openshift_storage_glusterfs_registry_block_host_vol_size=100  
openshift_storage_glusterfs_registry_block_storageclass=false
openshift_storage_glusterfs_registry_block_storageclass_default=false   

...
[nodes]

ose-app-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node03.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node04.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node03.ocpgluster.com openshift_node_group_name="node-config-compute"

[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs_registry]
ose-infra-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'

CNS 3.9 uninstall

With this release, the uninstall.yml playbook can be used to remove all gluster and heketi resources. This might come in handy when there are errors in inventory file options that cause the gluster cluster to deploy incorrectly.

If you’re removing a CNS installation that is currently being used by any applications, you should remove those applications before removing CNS, because they will lose access to storage. This includes infrastructure applications like registry.

If you have the registry using a glusterfs PersistentVolume, remove it with the following command:

oc delete deploymentconfig docker-registry
oc delete pvc registry-claim
oc delete pv registry-volume
oc delete service glusterfs-registry-endpoints

If running the uninstall.yml because a deployment failed, run the uninstall.yml playbook with the following variables to wipe the storage devices for both glusterfs and glusterfs_registry clusters before trying the CNS installation again:

ansible-playbook -i <path_to_inventory file> -e
"openshift_storage_glusterfs_wipe=True" -e 
"openshift_storage_glusterfs_registry_wipe=true" 
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml

CNS 3.9 post installation for applications and registry

You can add CNS clusters and resources to an existing OCP install using the following command. This same process can be used if CNS has been uninstalled due to errors.

ansible-playbook -i <path_to_inventory_file> 
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml

After the new cluster(s) is created and validated, you can deploy the registry using a newly created glusterfs ReadWriteMany volume. Run this playbook to create the registry resources:

ansible-playbook -i <path_to_inventory_file> 
/usr/share/ansible/openshift-ansible/playbooks/openshift-hosted/config.yml

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also watch this short video explaining why use CNS with OpenShift.

Storing tables in Ceph object storage

Introduction

In one of our previous posts, Anatomy of the S3A filesystem client, we showed how Spark can interact with data stored in a Ceph object storage in the same fashion it would interact with Amazon S3. This is all well and good if you plan on exclusively writing applications in PySpark or Scala, but wouldn’t it be great to allow anyone who is familiar with SQL to interact with data stored in Ceph?

That’s what SparkSQL is for, and while Spark has the ability to infer schema, it’s a lot easier if the data is already described in a metadata service like the Hive Metastore. The Hive Metastore stores table schema information, statistics on tables and partitions, and generally aids the query planners of various SQL engines query planners in constructing efficient query plans. So, regardless of whether you’re using good ol’ Hive, SparkSQL, Presto, or Impala, you’ll still be storing and retrieving metadata from a centralized store. Even if your organization has standardized on a single query engine, it still makes sense to have a centralized metadata service, because you’ll likely have distinct workload clusters that will want to share at least some data sets.

Architecture

The Hive Metastore can be housed in a local Apache Derby database for development and experimentation, but a more production-worthy approach would be to use a relational database like MySQL, MariaDB, or Postgres. In the public cloud, a best practice is to store the database tables on a distinct volume to get features like snapshots, and the ability to detach and reattach it to a different instance. In the private cloud, where OpenStack reigns supreme, most folks have turned to Ceph to provide block storage. To learn more about how to leverage Ceph block storage for database workloads, I suggest taking a look at the MySQL reference architecture we authored in conjunction with the open source database experts over at Percona.

While you can configure Hive, Spark, or Presto to interact directly with the MySQL database containing the Metastore, interacting with the Hive Server 2 Thrift service provides better concurrency and an improved security posture. Overall, the general idea is depicted in the following diagram:

Storing tabular data as objects

In a greenfield environment where all data will be stored in the object store, you could simply set hive.metastore.warehouse.dir to a S3A location a la s3a://hive/warehouse. If you haven’t already had a chance to read our Anatomy of the S3A filesystem client post, you should take a look if you’re interested in learning how to configure S3A to interact with a local Ceph cluster instead of Amazon S3. When a S3A location is used as the Metastore warehouse directory, all tables that are created will default to being stored in that particular bucket, under the warehouse pseudo directory. A better approach is to utilize external locations to map databases, tables, or simply partitions to different buckets – perhaps so they can be secured with distinct access controls or other bucket policy features. An example of including a external location specification during table creation might be:

create external table inventory
(
   inv_date_sk bigint,
   inv_item_sk bigint,
   inv_warehouse_sk bigint,
   inv_quantity_on_hand int
)
row format delimited fields terminated by ‘|’
location ‘s3a://tpc/inventory’;

That’s it, when you interact with this inventory table, data will be  directly read from the object store by way of the S3A filesystem client. One of the cool aspects of this approach is the location is abstracted away, you can write queries that scan tables with different locations, or even scan a single table with multiple locations. In this fashion, you might have recent data partitions with a MySQL external location, and data older than the current week in partitions with external locations that point to object storage. Cool stuff!

Serialization, partitions, and statistics

We all want to be able to analyze data sets quickly, and there are a number of tools available to help realize this goal. The first is using different serialization formats. In my discussions with customers, the two most common serialization formats are the columnar formats ORC and Parquet. The gist of these formats is that instead of requiring complete scans of entire files, columns of data are separated into stripes and metadata describing each column’s stripe offsets are stored in a file header or footer. When a query is planned, requests can read in only the stripes that are relevant to that particular query. For a more on different serialization formats, and their relative performance, I highly suggest this analysis by our friends over at Silicon Valley Data Science. We have seen great performance with both Parquet and ORC when used in conjunction with a Ceph object store. Parquet tends to be slightly faster, while ORC tends to use slightly less disk space. This small delta might simply be the result of these formats using different compression algorithms by default (snappy vs ZLIB). Speaking of compression, it’s really easy to think you’re using it, when you are in fact not. Make sure to verify that your tables are actually being compressed. I suggest including the compression specification in table creation statements instead of hoping the engine you are using has the defaults configured the way you want.

In addition to serialization formats, it’s important to consider how your tables are partitioned, and how many files you have per partition. All S3 API calls are RESTful, which means they are heavier weight than HDFS RPC calls. Having fewer larger partitions, with fewer files per partition, will definitely translate into higher throughput and reduced query latency. If you already have tables with loads of partitions, and many files per partition, it might be worthwhile to consolidate them with larger partitions with a fewer files each as you move them into object storage.

With data serialized and partitioned intelligently, queries can be much more efficient, but there is a third way you can help the query planner of your execution engine do its job better – table and column statistics. Table statistics can be collected with ANALYZE TABLE table COMPUTE STATISTICS statements, which count the number of rows for a particular table and their partitions. The row counts are stored in the Metastore, and can be used by other engines that interrogate the Metastore during query planning.

To the cloud!

Getting cloudy

Many modern enterprises have initiatives underway to modernize their IT infrastructures, and today that means moving workloads to cloud environments, whether they be public or private. On the surface, moving data platforms to a cloud environment shouldn’t be a difficult undertaking: Leverage cloud APIs to provision instances, and use those instances like their bare-metal brethren. For popular analytics workloads, this means running storage services in those instances that are specific to analytics and continuing  a siloed approach to storage. This is the equivalent of lift and shift for data-intensive apps, a shortcut approach undertaken by some organizations when migrating an enterprise app to a cloud when they don’t have the luxury of adopting a more contemporary application architecture.

The following data platform principles pertain to moving legacy data platforms to either a public cloud or a private cloud. The private cloud storage platform discussed is Ceph, of course, a popular open-source storage platform used in building private clouds for a variety of data-intensive workloads, including MySQL DBaaS and Spark/Hadoop analytics-as-a-service.

Elasticity

Elasticity is one of the key benefits of cloud infrastructure, and running storage services inside your instances definitely cramps your ability to take advantage of it. For example, let’s say you have an analytics cluster consisting of 100 instances and the resident HDFS cluster has a utilization of 80 percent. Even though you could terminate 10 of those instances and still have sufficient storage space, you would need to rebalance the data, which is often undesirable. You will also be out of luck if months later you also realize that you’re only using half the compute resources of that cluster. If the infrastructure teams make a new instance flavor available, say with fancy GPUs for your hungry machine-learning applications, it’ll be much harder to start consuming them if it entails the migration of storage services.

This is why companies like Netflix decided to use object storage as the source of truth for the analytics applications, as detailed in my previous post What about locality? It enables them to expand, and contract, workload-specific clusters as dictated by their resource requirements. Need a quick cluster with lots of nodes to chew through a one-time ETL? No problem. Need transient data labs for data scientists by day only to relinquish those resources for use for reporting after hours? Easy peasy.

Data infrastructure

Before departing on the journey to cloudify an organization’s data platform architecture, an important first step is assessing the capabilities of the cloud infrastructure, public or private, you intend to consume to make sure it provides the features that are most important to data-intensive applications. World-class data infrastructure provides their tenants with a number of fundamental building blocks that lend power and flexibility to the applications that will sit atop them.

Persistent block storage

Not all data is big, and it’s important to provide persistent block storage for data sets that are well served by database workhorses like MySQL and Postgres. An obvious example is the database used by the Hive metastore. With all these workload clusters being provisioned and deprovisioned, it’s often desirable to have them interact with a common metadata service. For more details about how persistent block storage fits into the dizzying array of architectural decisions facing database administrators, I suggest a read of our MySQL reference architecture.

I also suggest infrastructure teams learn how to collapse persistent block storage performance and spatial capacity into a single dimension, all while providing deterministic performance. For this, I recommend watching the session I gave with several of my colleagues at Red Hat Summit last year.

Local SSD

Sometimes we need to access data fast, really fast, and the best way to realize that is with locally attached SSDs. Most clouds make this possible with special instance flavors, modeled after the i3 instances provided by Amazon EC2. In OpenStack, the equivalent would be instances where the hypervisor uses PCIe passthrough for NVMe devices. These devices are best leveraged by applications that handle their own replication and fault tolerance, good examples being Cassandra and Clustered Elasticsearch. Fast local devices are also useful for scratch space for intermediate shuffle data that doesn’t fit in memory, or even S3A buffer files.

GPUs

Machine learning frameworks like TensorFlow, Torch, and Caffe can all benefit from GPU acceleration. With the burgeoning popularity of these frameworks, it’s important that infrastructure cater to them by providing instances flavors infused with GPU goodness. In OpenStack, this can be accomplished by passing through entire GPU devices in a similar fashion detailed in the Local SSD section, or by using GPU virtualization technologies like Intel GVT-g or NVIDIA GRID vGPU. OpenStack developers have been diligently integrating these technologies, I’d recommend operations folk understand how to deploy them once these features mature.

Object storage

In both public and private clouds, deploying multiple analytics clusters backed by object storage is becoming increasingly popular. In the private cloud, a number of things are important to prepare a Ceph object store for data intensive applications.

Bucket sharding

Bucket sharding was enabled by default with the advent of Red Hat Ceph Storage 3.0. This feature spreads a bucket’s indexes across multiple shards with corresponding RADOS objects. This is great for increasing the write throughput of a particular bucket, but comes at the expense of LIST operations. This is because of the way index entries are interleaved, and needing to gather entries before replying to a LIST request. Today, the S3A filesystem client performs many LIST requests, and as such it is advantageous to disable bucket sharding with rgw_override_bucket index_max_shards set to 1.

Bucket indexes on SSD

The Ceph object gateway uses distinct pools for objects and indexes, and as such those pools can be mapped to different device classes. Due to the S3A filesystem client’s heavy usage of LIST operations, it’s highly recommended that index pools be mapped to OSDs sitting on SSDs for fast access. In many cases, this can be achieved even on existing hardware by using the remaining space on devices that house OSD journals.

Erasure coding

Due to the immense storage requirements of data platforms, erasure coding for the data section of objects is a no brainer. Compared to 3x replication that’s common with HDFS, erasure coding reduced the required storage by 50%. When tens of petabytes are involved, that amounts to big savings! Most folks will probably end up using either 4+2 or 8+3 and spreading chunks across hosts using ruleset-failure-domain=host.

Anatomy of the S3A filesystem client

Amazon introduced their Simple Storage Service (S3) in March 2006, which proved to be a watershed moment that ushered in the era of cloud computing services. It wasn’t long before folks started trying to use Amazon S3 in conjunction with Apache Hadoop; in fact, the first attempt was the S3 block filesystem, which was completed before the end of the year. This early integration stored data in a way that facilitated fast renames and deletes, but came at the expense of not being able to access data that had been written to S3 directly. Accessing data written by other applications was highly desirable, and by 2008 the S3N, or S3 Native Filesystem, was merged into Apache Hadoop. The following year, Amazon introduced Elastic MapReduce, which included Amazon’s own close-sourced S3 filesystem client.

Netflix was an early adopter of Elastic MapReduce, which was used to analyze data to improve streaming quality. This workload was one of three detailed in Amazon’s press release announcing Netflix’s intent to migrate a variety of applications to Amazon. Fast forward to 2013 and “any data set worth retaining” was stored in S3. To the best of my knowledge, this was the first public reference of a multi-cluster data platform that used S3 for shared storage. With all the chips on the table, Netflix and other cloud heavyweights started seriously thinking about how to better leverage the S3 API and improve S3 client performance. This led to the development of the successor to S3N, named S3A. The development of the S3A filesystem client has manifested as a series of phases and is still seeing loads of active development from the likes of Netflix, Cloudera, Hortonworks, and a whole host of others. Each phase of development is tracked in the Hadoop JIRA:

Downstream Hadoop distributions have done a terrific job of ensuring that juicy features are expeditiously backported, so even if your vendors distribution is based on an older Hadoop version, it’s likely that they are ready to consume data stored in S3.

Working with Ceph

Back in the early days of Ceph, we quickly came to the realization that it would be useful to allow folks to use the S3 API to interact with their Ceph storage infrastructure. By doing so, we’d be able to leverage the might of the Amazon ecosystem, including all the SDKs and tools written to interact with Amazon S3. The Ceph object gateway was conceived for this application and is an essential ingredient in providing private infrastructure operators a means to extend cloud object storage modalities into their data center. The S3 API has an ever-expanding set of calls, and keeping up requires a lot of hard work. To keep us honest, we actively develop a functional test suite affectionately named s3-tests. The test suite is so useful that it’s even been adopted by other storage vendors to ensure their products can have the same high level of fidelity with the S3 API—How flattering!

So Ceph and S3 are like two high school buddies, and that’s great. But what’s required to ensure maximum fidelity with Amazon S3?

Endpoints

The Amazon S3 API provides a high-level container for objects which, in S3 parlance, is called a bucket. All objects are stored in exactly one bucket, and each bucket is mapped to a single Amazon S3 region. If you send a GET request for an object that lives in a bucket in another region, you’ll get a 302 redirect to the proper region. PUT requests sent to the wrong region will fail and be provided with the endpoint of the region the bucket calls home. If you have multiple Ceph clusters and want to emulate this behavior, you can do so by having each cluster be a distinct zone, in a distinct zonegroup, but with a common realm. For more details, refer to Ceph multi-site documentation.

If data engineers have a level of understanding about different endpoints available to them, then they can set fs.s3a.endpoint in a properties file such that it directs requests to the correct Ceph cluster or Amazon S3 region.

If you want to be helpful and know that an analytics cluster will only be talking to a single Ceph cluster, and not Amazon S3, then you might set the fs.s3a.endpoint in the core-site.xml. If you plan on having multiple Ceph clusters, or communicating with a Ceph cluster and the public cloud, then you have a few options. One is to set a default f3.s3a.endpoint in the analytics clusters’ core-site.xml to either an Amazon S3 regional endpoint or a Ceph endpoint. Application owners can still override this default with a properties file.

With  S3A from Apache Hadoop 2.8.1 users can define different S3A properties for different buckets. This is handy for applications that might operate on data sets in one or more Ceph cluster or Amazon S3 region.

Access control

Now that there is a level of understanding around endpoints, we can move on to authentication and access control. Amazon S3 has evolved over the years to provide more flexible control over who has access to what. Coarsely, the evolution can be broken down into multiple approaches.

  1. S3 ACLs
  2. Bucket policy with S3 users
  3. STS issued temporary credentials

The first approach is table stakes for any storage system that wants to allow applications to use the S3 API to interact with them. Each user has an access key and a secret key which they use to sign requests. Buckets always belong to a single user, and buckets can be specified as either public or private. Public buckets is the only means of sharing data between users, which is pretty inflexible. Ceph has supported basic S3 ACLs since the Ceph object gateway was introduced, and it’s not uncommon to see this be the only means of providing access control with other systems that tout an S3 compatible API. These access and secret keys can be mapped to the fs.s3a.access.key and fs.s3a.secret.key parameters in core-site.xml, in a properties file submitted with an application, or the most secure option: storing them in encrypted files using the Hadoop Credentials API.

Ceph doesn’t stop there. With Red Hat Ceph Storage 3.0, based on upstream Luminous, we added support for bucket policy, which allows cross user bucket sharing. This means one user’s private bucket can be shared with another user through bucket policy.

This is all good and well, but sometimes you want to manage access control for groups of users, instead of having to update a bunch of bucket policies whenever a user needs to be removed from a group. Amazon S3 doesn’t yet have its own notion of groups. Instead, Amazon has IAM, which is a means of enforcing role-based access control. IAM supports users, groups, and roles. This means you can create policies for a group, instead of having policy entries for each individual user. Unfortunately, IAM groups cannot be principals in bucket policies. IAM roles are similar to a user, but they do not have credentials. IAM roles delegate to some other authentication provider, and that authentication provider decides if the request is allowed to assume a particular role.

This is where Amazon STS comes in. STS issues temporary authentication tokens, which are mapped to IAM roles, and that mapping is attested by an external identity provider. As such, the external credentials provider can attest to whether a particular user can receive a token that allows that user to assume a IAM role. The S3A filesystem client supports the use of these tokens by changing the S3A Credentials Provider and setting the fs.s3a.session.token parameter in addition to fs.s3a.access.key and fs.s3a.secret.key. The upstream community is currently working on STS and IAM support in the Ceph object gateway, which will bring all this goodness to folks interacting with Ceph object stores. If this is important to your organization, we’re interested in getting feedback that helps us prioritize STS actions like AssumeRoleWithSAML and AssumeRoleWithWebIdentity.

Bucket prefix vs. path

The AWS Java SDK for S3 default is to use bucket prefix notation when sending requests and, by extension, so does the S3A filesystem client. If your Ceph object gateway endpoint is s3.example.com, then requests are sent to bucket.s3.example.com. To ensure your Ceph storage infrastructure behaves the same way as Amazon S3, you’ll need to configure a few things on the infrastructure side. The first is including the rgw_dns_name parameter in the [rgw] or[global] block of your ceph.conf configuration file. The value of this parameter in this scenario would be s3.example.com. Now, in order for the client to resolve the bucket subdomain, you’ll also need a wildcard DNS record in the form of *.s3.example.com that resolves to your gateway virtual IP address. To support SSL with bucket prefix notation, you’ll need to use a certificate with a wildcard subject alternative name (SAN) wherever it is being used to terminate TLS.

If you’re not on the infrastructure side of things, and you want to consume Ceph object infrastructure where this machinery hasn’t been configured, then you can opt for path style access by setting fs.s3a.path.style.access to true. In this configuration, requests will be sent to s3.example.com/bucket instead of bucket.s3.example.com.

There is a third trick, which might be helpful in unusual scenarios, and that’s to create a bucket with uppercase characters. Because DNS isn’t case sensitive, the SDK automatically switches over to path style access.

Encryption

We talked a bit about SSL/TLS in the previous section, but only insofar as how to configure things on the infrastructure side. The S3A filesystem client will default to using SSL/TLS. Changing this behavior is done by switching the fs.s3a.connection.ssl.enabled parameter to false. This is useful in scenarios like testing, or when SSL isn’t yet configured for your Ceph storage infrastructure. In a production environment, the adage “dance like nobody’s watching, and encrypt like everyone is,” still applies. Check out this blog post for a thorough walk through on configuring SSL/TLS for you Ceph object storage system.

In addition to transport encryption, objects can be encrypted. Amazon S3 provides two categories of object encryption: (1) encryption at the client and (2) server-side encryption. The S3A filesystem client does not support using client-side encryption, but it does support all three varieties of server-side encryption. The three server-side encryption options are:

  • SSE-S3: Keys managed internally by Amazon S3
  • SSE-C: Keys managed by the client and passed to Amazon S3 to encrypt/decrypt requests
  • SSE-KMS: Keys managed through Amazon’s Key Management Service (KMS)

With the advent of  Red Hat Ceph Storage 3.0, we support the SSE-C flavor of server-side encryption. Configuring the S3A filesystem client to use it is a simple affair, involving only two parameters: (1) fs.s3a.server-side-encryption-algorithm should be set to SSE-C and (2) the value of fs.s3a.server-side-encryption.key should be set to your secret key. I suspect most folks will want to do this by way of a properties file.

Upstream Ceph also has support for SSE-KMS, which is intended to be integrated with OpenStack Barbican for key management. I imagine it’s only a matter of time before this is also supported in Red Hat Ceph Storage.

Fast upload

There are two main ways of performing writes with the S3A filesystem client. The default mode buffers uploads to disk before sending a request to Amazon S3; it’s important to make sure that the directory you are buffering to is large and fast. I prefer to specify a directory supported by a local SSD. Even with fast media, this can be expensive from an IO perspective, so an alternative is to use in memory buffers. The S3A filesystem client offers two in memory buffer options: (1) one using arrays and (2) one using byte buffers. You have to be careful when using memory buffers, because it’s easy to run out if you don’t properly size both your JVM and YARN containers. If the available memory permits you to use in memory buffers, they can be much faster than their disk-based brethren.

Giving it a try

There’s an easy way to learn how to use the S3A filesystem client with Ceph, and this section will walk you through setting up a development environment using Minishift. If you’ve never used Minishift before, it’s relatively painless to set up. The guides for installation on Mac OSX, Windows, and Linux can be found here.

Once you’ve installed Minishift on your local system, drop into a terminal and fire it up.

minishift start

Now that we have a Minishift environment running, we’re going to get Ceph Nano running so we can use it as a S3A endpoint. To get started, you’ll need to download a couple of YAML files: ceph-nano.yml and ceph-rgw-keys.yml. Once those files are downloaded to your working directory, we can use the oc command line utility to deploy Ceph Nano:

oc --as system:admin adm policy add-scc-to-user anyuid \
  system:serviceaccount:myproject:default
oc create -f ceph-rgw-keys.yml
oc create -f ceph-nano.yml
oc expose pod ceph-nano-0 --type=NodePort

You should have a Ceph Nano service running at this point and may be wondering how you’re going to interact with it. Jupyter Notebooks have become wildly popular in the analytics and data-science communities because of their ability to create a single artifact, the notebook, which contains both code and documentation. As if that wasn’t enough, you can interact with them from the comfort of your web browser. Here at Red Hat, we’ve been fostering an awesome community called radanalytics.io. They’re hard at work empowering intelligent application development on OpenShift. They’ve provided a base notebook application that I used as a starting point for playing with S3A from Jupyter, because its container image is neatly bundled with Spark and PySpark libraries. You can get Jupyter up and running in your Minishift environment with just a few commands:

oc new-app https://github.com/mmgaggle/ceph-notebook \
  -e JUPYTER_NOTEBOOK_PASSWORD=developer \
  -e RGW_API_ENDPOINT=$(minishift openshift service ceph-nano-0 --url) \
  -e JUPYTER_NOTEBOOK_X_INCLUDE=http://mmgaggle-bd.s3.amazonaws.com/ceph-notebook.ipynb

oc env --from=secret/ceph-rgw-keys dc/ceph-notebook
oc expose svc/ceph-notebook
oc status

The “oc status” command will provide a http://ceph-notebook-myproject.$(IP).nip.io URL. Load that into your browser and follow along with your browser!

HTTPS-ization of Ceph object storage public endpoint

Hypertext Transfer Protocol Secure (HTTPS) is the secure version of HTTP, uses encrypted communication between the user and the server. HTTPS avoids Man-in-the-Middle-Attack attacks by relying on Secure Socket Layer (SSL) and Transport Layer Security (TLS) protocols to establish an encrypted connection to shuttle data securely between a client and a server.

This blog post takes you step by step through the process of adding SSL/TLS security to Ceph object storage endpoints. Ceph is a scalable, open-source object storage solution that provides Amazon S3 and SWIFT compatible APIs you can use to build your own public or private cloud object storage solution. Learn more about Ceph at Red Hat Ceph Storage.

Setting up domain record sets

Let’s begin by setting up a domain name and configuring its record sets. If you already have a domain name you’d like to use for the Ceph endpoint, great. If not, you can purchase one from any domain registrar. In this example, we’ll use the domain name ceph-s3.com.

First, we need to add record sets to the domain such that the domain name can resolve into the IP address of the Ceph RADOS Gateway or the load balancer (LB) that is fronting your Ceph RADOS Gateways. To do this, you need to log in to your domain registrar dashboard and create two record sets:

  1. A type recordset with Domain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host. (If you’re using IPv6, choose AAAA type.)
  2. CNAME type recordset Wildcard Subdomain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host

Note: The reason we chose the wildcard subdomain name is important. We want to resolve all subdomains (e.g., bucket.ceph-s3.com) to the same IP address, because S3 treats subdomain prefixes as the bucket name.

Domain record set additions are highlighted in the following screenshot:

DNS changes usually take a few tens of minutes to propagate, once the changes are synced. We should be able to ping domain name as well as any subdomain from anywhere on the Internet, provided ICMP Ingress traffic is allowed on the host.

karasing-OSX:~$ ping ceph-s3.com
PING ceph-s3.com (54.209.4.196): 56 data bytes
64 bytes from 54.209.4.196: icmp_seq=0 ttl=40 time=1095.231 ms
64 bytes from 54.209.4.196: icmp_seq=1 ttl=40 time=1099.570 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1199.266 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1199.266 ms
^C
--- ceph-s3.com ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1095.231/1131.356/1199.266/48.053 ms
karasing-OSX:~$
karasing-OSX:~$ ping anynameintheuniverse.ceph-s3.com
PING 54.209.4.196 (54.209.4.196): 56 data bytes
64 bytes from 54.209.4.196: icmp_seq=0 ttl=40 time=1491.105 ms
64 bytes from 54.209.4.196: icmp_seq=1 ttl=40 time=1262.021 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1205.943 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1205.943 ms
^C
--- 54.209.4.196 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1205.943/1319.690/1491.105/123.352 ms
karasing-OSX:~$

SSL certificate: Installation and setup

A SSL certificate is a set of encrypted files that binds an organization’s identity, domain name, IP address, and cryptographic keys. Once these SSL certificates are installed on the host server, a secure connection is allowed between the host server and the client machine. In a client’s Internet browser, a green padlock will appear next to the URL as a visual cue to users that traffic is protected.

Some organizations might desire a more advanced certificate that requires additional validation. These SSL certificates must be purchased from a trusted Certificate Authority (CA). For the sake of demonstration in this example, we’ll use Let’s Encrypt which is a certificate authority that provides free X.509 certificates for Transport Layer Security (TLS) encryption. [ credits: [wikipedia](https://en.wikipedia.org/wiki/Let’s_Encrypt) & thanks Let’s Encrypt for your free service]

Next, we’ll install epel-release and certbot, a CLI tool for requesting SSL certificates, from Let’s Encrypt CA.

yum install -y http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install -y certbot

Request for SSL certificate using certbot CLI client

certbot certonly --manual -d ceph-s3.com -d *.ceph-s3.com --agree-tos
manual-public-ip-logging-ok --preferred-challenges dns-01 --server
https://acme-v02.api.letsencrypt.org/directory

Note: The first -d option in certbot CLI represent base domain, the subsequent -d options represents sub-domains.

Important: Make sure you are using wildcard (*) for subdomain, because we are requesting a wildcard subdomain SSL certification from Let’s Encrypt. If a certificate is only issued for base domain,  it will not be compatible with subdomain prefix notation.

The following snippet shows the output of certbot CLI command. The DNS challenge method will generate two DNS TXT records, which must be added as a TXT record set for your domain (from the domain registrar dashboard):

[root@ceph-admin ~]# certbot certonly --manual -d ceph-s3.com -d *.ceph-s3.com --agree-tos
--manual-public-ip-logging-ok --preferred-challenges dns-01 --server
https://acme-v02.api.letsencrypt.org/directory
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for ceph-s3.com
dns-01 challenge for ceph-s3.com

-------------------------------------------------------------------------------
Please deploy a DNS TXT record under the name
_acme-challenge.ceph-s3.com with the following value:

BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4   <== This Value

Before continuing, verify the record is deployed.

-------------------------------------------------------------------------------
Press Enter to continue

-------------------------------------------------------------------------------
Please deploy a DNS TXT record under the name
_acme-challenge.ceph-s3.com with the following value:

O-_g-eeu4cSI0xXSdrw3OBrWVgzZXJC59Xjkhyk39MQ    <== This Value

Before continuing, verify the record is deployed.

——————————————————————————-

Keep the certbot CLI command running, and open your domain registrar dashboard. Deploy a new DNS TXT record under the name _acme-challenge.yourdomain.com, and enter both these values (marked with <== in the preceding output) in two separate lines inside double quotes. The following screenshot shows how:

Keep the certbot CLI command running and, once you’ve successfully added DNS TXT record in your domain record set, open another terminal. Now, we’ll verify if DNS TXT records are applied on your domain by running the command host -t txt _acme-challenge.yourdomain.com. You should be able to see the same DNS TXT as mentioned in certbot CLI. If you do not, wait for a few minutes for DNS synchronization to occur.

karasing-OSX:~$ host -t txt _acme-challenge.ceph-s3.com
_acme-challenge.ceph-s3.com descriptive text
"BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4"
_acme-challenge.ceph-s3.com descriptive text
"O-_g-eeu4cSI0xXSdrw3OBrWVgzZXJC59Xjkhyk39MQ"
karasing-OSX:~$

Once you’ve verified DNS TXT records are applied to your domain, return to certbot CLI and press Enter to continue. You will be notified that the system is waiting for verification and cleaning up challenges. You should then see the following message:

Congratulations! Your certificate and chain have been saved at:
/etc/letsencrypt/live/ceph-s3.com/fullchain.pem
Your key file has been saved at:
/etc/letsencrypt/live/ceph-s3.com/privkey.pem
Your cert will expire on 2018-10-07. To obtain a new or tweaked
version of this certificate in the future, simply run certbot
again. To non-interactively renew *all* of your certificates, run
"certbot renew"

If you like Certbot, please consider supporting our work by:

Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
Donating to EFF:                    https://eff.org/donate-le

[root@ceph-admin ~]#

At this point, SSL certificates have been issued to your domain/host, so now we must verify them.

[root@ceph-admin ~]# ls -l /etc/letsencrypt/live/ceph-s3.com/
total 4
lrwxrwxrwx 1 root root  35 Jul 9 19:06 cert.pem -> ../../archive/ceph-s3.com/cert1.pem
lrwxrwxrwx 1 root root  36 Jul 9 19:06 chain.pem -> ../../archive/ceph-s3.com/chain1.pem
lrwxrwxrwx 1 root root  40 Jul 9 19:06 fullchain.pem -> ../../archive/ceph-s3.com/fullchain1.pem
lrwxrwxrwx 1 root root  38 Jul 9 19:06 privkey.pem -> ../../archive/ceph-s3.com/privkey1.pem
-rw-r--r-- 1 root root 682 Jul  9 19:06 README
[root@ceph-admin ~]#

Installing LB for object storage service

Many like HAProxy, because it’s easy and does its job well. In this example, we will use HAProxy to perform SSL termination for our domain name (Ceph object storage endpoint).

Note: Starting with Red Hat Ceph Storage 2, Ceph RADOS Gateway natively supports TLS by relying on OpenSSL Library. You can get more information on native SSL/TLS configuration here. In this example, we specifically choose to terminate SSL at HAProxy level. This gives us an advantage like when we have multiple instances of Ceph RGW we do not need to get multiple domain names/SSL certificates for each of them. One domain name with SSL termination at LB does the job.

Next, we’ll install HAproxy on the same host whose public IP has been bound with your domain name, in this case, ceph-s3.com.

yum install -y haproxy

Next, we’ll create a certs directory.

mkdir -p /etc/haproxy/certs

Combine certificate files fullchain.pem and privkey.pem into a single file for our domain next.

DOMAIN='ceph-s3.com' sudo -E bash -c 'cat /etc/letsencrypt/live/$DOMAIN/fullchain.pem
/etc/letsencrypt/live/$DOMAIN/privkey.pem > /etc/haproxy/certs/$DOMAIN.pem'

The next step is to change the permission of the certs directory.

chmod -R go-rwx /etc/haproxy/certs

Optionally, you can move the haproxy.cfg original file and create a new config file with the following configuration settings:

mv /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.orig
vim /etc/haproxy/haproxy.cfg
global
    log        127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    tune.ssl.default-dh-param 2048     
    stats socket /var/lib/haproxy/stats
defaults
    mode                    http
    log                    global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    option httpchk HEAD /
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000
frontend www-http
   bind 10.100.0.10:80
   reqidel                      ^X­Forwarded­For:.*
   reqadd X-Forwarded-Proto:\ http
   default_backend www-backend
   option  forwardfor
frontend www-https
   bind 10.100.0.10:443 ssl crt /etc/haproxy/certs/ceph-s3.com.pem
   reqadd X-Forwarded-Proto:\ https
   acl letsencrypt-acl path_beg /.well-known/acme-challenge/
   use_backend letsencrypt-backend if letsencrypt-acl
   default_backend www-backend
backend www-backend
   redirect scheme https if !{ ssl_fc }
   server ceph-admin 10.100.0.10:8081 check inter 2000 rise 2 fall 5
backend letsencrypt-backend
   server letsencrypt 127.0.0.1:54321

As you may have noted, we’re using a couple of non-default parameters in the haproxy config file, such as:

  • tune.ssl.default-dh-param is required to provide OpenSSL the necessary parameters for the SSL/TLS handshake.
  • frontend www-http binds haproxy to port 80 of the local machine and redirects traffic to the default backend www-backend. If a user uses HTTP protocol in the request, it should redirect to HTTPS.
  • frontend www-https binds haproxy to port 443 of the local machine. It also redirects traffic to the default backend www-backend and uses the SSL certificate path to encrypt/terminate the traffic.
  • frontend www-https also uses letsencrypt-backend if you want to auto-renew the SSL certificate from Let’s Encrypt CA.
  • backend www-backend simply redirects all the SSL terminated traffic to Ceph RGW node 10.100.0.10:8081. For HA and performance, you must have multiple Ceph RGW instances whose IPs should be added in the same backend section so that HAProxy can load balance among Ceph RGW instances.

Finally, we restart HAproxy and verify its listening on ports 80 and 443:

systemctl start haproxy
systemctl status haproxy ; netstat -plunt | grep -i haproxy

Configuring Ceph RGW

Up to this point, we’ve configured the domain name, set required record sets, generated SSL certificate, configured HAProxy to encrypt/terminate SSL, and redirected traffic to the Ceph RGW instance. Now we need to configure Ceph RGW to listen on port 8081 as configured in HAproxy. To do so, in the Ceph RGW node edit /etc/ceph/ceph.conf and update the client.rgw section as shown in the following:

[client.rgw.ceph-admin]
host = ceph-admin
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-admin/keyring
log file = /var/log/ceph/ceph-rgw-ceph-admin.log
rgw frontends = civetweb port=10.100.0.10:8081 num_threads=512
rgw resolve cname = true
rgw dns name = ceph-s3.com

Note: rgw resolve cname = true forces rgw to use the DNS CNAME record of the request hostname field (if the hostname is not equal to rgw dns name).

Note: rgw dns name = ceph-s3.com is the DNS name of the served domain.

Now, we’ll restart the Ceph RGW instance and verify its listening on port 8081.

systemctl restart ceph-radosgw@rgw.ceph-admin.service
netstat -plunt | grep -i rados

Accessing Ceph object storage secure endpoint

To test the HTTPS-enabled Ceph object storage URL, execute the following curl command or type https://ceph-s3.com in any web browser:

curl https://ceph-s3.com

It should yield output like the following:

[student@ceph-admin ~]$ curl https://ceph-s3.com
anonymous
[student@ceph-admin ~]$
[student@ceph-admin ~]$

Let’s try accessing Ceph object storage using S3cmd:

yum install -y s3cmd

Configure S3cmd CLI by providing config options like access/secret keys, Ceph S3 secure endpoints in host/host-bucket parameters.

s3cmd --access_key=S3user1 --secret_key=S3user1key --host=ceph-s3.com
--host-bucket="%(bucket)s.ceph-s3.com" --dump-config > /home/student/.s3cfg

Note: By default, s3cmd uses HTTPS connection, so there is no need to explicitly specify that.

Next, we’ll interact with Ceph object storage using s3cmd ls, s3cmd mb commands:

[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
[student@ceph-admin ~]$
[student@ceph-admin ~]$ s3cmd mb s3://secure_bucket
Bucket 's3://secure_bucket/' created
[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
2018-07-09 19:55  s3://secure_bucket
[student@ceph-admin ~]$

Congratulations! You’ve successfully secured your Ceph object storage endpoint using you domain name of your choice and SSL certificates.

Closing

As you can see, acquiring and setting up SSL certificates involves some careful configuration and depends on your chosen CA (how Easy & Fast to acquire SSL certificate). With initiatives like “HTTPS Everywhere,” it’s no longer just web sites hosting deliverable content that must have SSL; API and service endpoints should also offer encrypted transport.

Note: HTTPS is designed to prevent eavesdropping and Man-in-the-Middle-Attacks. Always practice defense in depth. Multiple layers of security are needed to more fully secure your web site/service endpoints.

What about locality?

This is the first post of a multi-part series of technical blog posts on Spark on Ceph:

  1. What about locality?
  2. Anatomy of the S3A filesystem client
  3. To the cloud!
  4. Storing tables in Ceph object storage
  5. Comparing with HDFS—TestDFSIO
  6. Comparing with remote HDFS—Hive Testbench (SparkSQL)
  7. Comparing with local HDFS—Hive Testbench (SparkSQL)
  8. Comparing with remote HDFS—Hive Testbench (Impala)
  9. Interactive speedup
  10. AI and machine learning workloads
  11. The write firehose

Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:

“Will performance suck because the benefits of locality are lost?”

It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.

Historical context

Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.

Cluster of 1800 machines, [each with]:

  • 4GB of memory
  • Dual-processor 2 GHz Xeons with hyperthreading
  • Dual 160GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth of 100 Gb/s

If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.

The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.

Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.

While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.

Advantages

When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.

Disadvantages

Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.

One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.

Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.

In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.

In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.

Closing

Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.

Why Spark on Ceph? (Part 3 of 3)

Introduction

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we in the Red Hat Storage solutions architecture team wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (this blog post)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series in mid-July, providing detailed descriptions, test data, and configuration scenarios.

Findings summary

We did Ceph vs. HDFS testing with a variety of workloads (see blog Part 2 of 3 for general workload descriptions). As expected, the price/performance comparison varied based on a number of factors, summarized below.

Clearly, many factors contribute to overall solution price. As storage capacity is frequently a major component of big data solution price, we chose it as a simple proxy for price in our price/performance comparison.

The primary factor affecting storage capacity price in our comparison was the data durability scheme used. With 3x replication data durability, a customer needs to buy 3PB of raw storage capacity to get 1PB of usable capacity. With erasure coding 4:2 data durability, a customer only needs to buy 1.5PB of raw storage capacity to get 1PB of usable capacity. The primary data durability scheme used by HDFS is 3x replication (support for HDFS erasure coding is emerging, but is still experimental in several distributions).  Ceph has supported either erasure coding or 3x replication data durability schemes for years. All Spark-on-Ceph early adopters we worked with are using erasure coding for cost efficiency reasons. As such, most of our tests were run with Ceph erasure coded clusters (we chose EC 4:2). We also ran some tests with Ceph 3x replicated clusters to provide apples-to-apples comparison for those tests.

Using the proxy for relative price noted above, Figure 1 provides an HDFS v. Ceph price/performance summary for the workloads indicated:

Figure 1: Relative price/performance comparison, based on results from eight different workloads

Figure 1 depicts price/performance comparisons based on eight different workloads. Each of the eight individual workloads was run with both HDFS and Ceph storage back-ends. The storage capacity price of the Ceph solution relative to the HDFS solution is either the same or 50% less. When the workload was run with Ceph 3x replicated clusters, the storage capacity price is shown as the same as HDFS. When the workload was run with Ceph erasure coded 4:2 clusters, the Ceph storage capacity price is shown as 50% less than HDFS. (See the previous discussion on how data durability schemes affect solution price.)

For example, workload 8 had similar performance with either Ceph or HDFS storage, but the Ceph storage capacity price was 50% of the HDFS storage capacity price, as Ceph was running an erasure coded 4:2 cluster. In other examples, workloads 1 and 2 had similar performance with either Ceph or HDFS storage and also had the same storage capacity price (workloads 1 and 2 were run with a Ceph 3x replicated cluster).

Findings details

A few details are provided here for the workloads tested with both Ceph and HDFS storage, as depicted in Figure 1.

  1. This workload was a simple test to compare aggregate read throughput via TestDFSIO. As shown in Figure 2, this workload performed comparably between HDFS and Ceph, when Ceph also used 3x replication. When Ceph used erasure coding 4:2, the workload performed better than either HDFS or Ceph 3x for lower numbers of concurrent clients (<300). With more client concurrency, however, the workload performance on Ceph 4:2 dropped due to spindle contention (a single read with erasure coded 4:2 storage requires 4 disk accesses, vs. a single disk access with 3x replicated storage.)

    Figure 2: TestDFSIO read results
  2. This workload compared the SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2.

    Figure 3: Single-user Spark query set results
  3. This workload compared Impala query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was 57% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  4. This mixed workload featured concurrent execution of a single-user running SparkSQL queries (54), 10-users each running Impala queries (54 each), and a data set merge/join job enriching TPC-DS web sales data with synthetic clickstream logs. As illustrated in Figure 1, the aggregate execution time of this mixed workload on Ceph EC4:2 was 48% slower compared to HDFS. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  5. This workload was a simple test to compare aggregate write throughput via TestDFSIO. As depicted in Figure 1, this workload performed, on average, 50% slower on Ceph EC4:2 compared to HDFS, across a range of concurrent client/writers. However, price/performance was nearly comparable, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  6. This workload compared SparkSQL query performance of a single-user executing a series of queries (the 54 TPC-DS queries, as described blog 2 of 3). As illustrated in Figure 3, the aggregate query time was comparable when running against either HDFS or Ceph 3x replicated storage. The aggregate query time doubled when running against Ceph EC4:2. However, price/performance was nearly comparable when running against Ceph EC4:2, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.
  7. This workload featured enrichment (merge/join) of TPC-DS web sales data with synthetic clickstream logs, and then writing the updated web sales data. As depicted in Figure 4, this workload was 37% slower on Ceph EC4:2 compared to HDFS. However, price/performance was favorable for Ceph, as the HDFS storage capacity costs were 2x those of Ceph EC4:2.

    Figure 4: Data set enrichment (merge/join/update) job results
  8. This workload compared the SparkSQL query performance of 10-users each executing a series of queries concurrently (the 54 TPC-DS queries were executed by each user in a random order). As illustrated in Figure 1, the aggregate execution time of this workload on Ceph EC4:2 was roughly comparable to that of HDFS, despite requiring only 50% the storage capacity costs. Price/performance for this workload thus favors Ceph by 2x. For more insight into this workload performance, see Figure 5. In this box-and-whisker plot, each dot reflects a single SparkSQL query execution time. As each of the 10-users concurrently executes 54 queries, there are 540 dots per series. The three series shown are Ceph EC4:2 (green), Ceph 3x (red), and HDFS 3x (blue). The Ceph EC4:2 box shows comparable median execution times to HDFS 3x, and shows more consistent query times in the middle 2 quartiles.
Figure 5: Multi-user Spark query set results

Bonus results section: 24-hour ingest

One of our prospective Spark-on-Ceph customers recently asked us to illustrate Ceph cluster sustained ingest rate over a 24-hour time period. For these tests, we used variations of the lab as described in blog 2 of 3. As noted in Figure 6, we measured a raw ingest rate of approximately 1.3 PiB per day into a Ceph EC4:2 cluster configured with 700 HDD data drives (Ceph OSDs).

Figure 6: Daily data ingest rate into Ceph clusters of various sizes

Concluding observations

In conclusion, below is our formative cost/benefit analysis of the above results summarizing this blog series.

  • Benefits, Spark-on-Ceph vs. Spark on traditional HDFS:
    1. Reduce CapEx by reducing duplication: Reduce PBs of redundant storage capacity purchased to store duplicate data sets in HDFS silos, when multiple analytics clusters need access to the same data sets.
    2. Reduce OpEx/risk: Eliminate costs of scripting/scheduling data set copies between HDFS silos, and reduce risk-of-human-error when attempting to maintain consistency between these duplicate data sets on HDFS silos, when multiple analytics clusters need access to the same data sets.
    3. Accelerate insight from new data science clusters: Reduce time-to-insight when spinning-up new data science clusters by analyzing data in-situ within a shared data repository, as opposed to hydrating (copying data into) a new cluster before beginning analysis.
    4. Satisfy different tool/version needs of different data teams: While sharing data sets between teams, enable users within each cluster to choose the Spark/Hadoop tool sets and versions appropriate to their jobs, without disrupting users from other teams requiring different tools/versions.
    5. Right-size CapEx infrastructure costs: Reduce over-provisioning of either compute or storage common with provisioning traditional HDFS clusters, which grow by adding generic nodes (regardless if only more CPU cores or storage capacity is needed), by right-sizing compute needs (vCPU/RAM) independently from storage capacity needs (throughput/TB).
    6. Reduce CapEx by improving data durability efficiency: Reduce CapEx of storage capacity purchased by up to 50% due to Ceph erasure coding efficiency vs. HDFS default 3x replication.
  • Costs, Spark-on-Ceph vs. Spark on traditional HDFS:

    1. Query performance: Performance of Spark/Impala query jobs ranged from 0%-131% longer execution times (single-user and multi-user concurrency tests).
    2. Write-job performance: Performance of write-oriented jobs (loading, transformation, enrichment) ranged from 37%-200%+ longer execution times. [Note: Significant improvements in write-job performance are expected when downstream distributions adopt the following upstream enhancements to the Hadoop S3A client HADOOP-13600, HADOOP-13786, HADOOP-12891].
    3. Mixed-workload Performance: Performance of multiple query and enrichment jobs concurrently executed resulted in 90% longer execution times.

For more details (and a hands-on chance to kick the tires of this solution yourself), stay tuned for the architect-level blog series in this same Red Hat Storage blog location. Thanks for reading.

Why Spark on Ceph? (Part 2 of 3)

Introduction

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (see “Why Spark on Ceph? (Part 1 of 3)”)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (this blog post)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph (Part 3 of 3)“)

For those wanting more depth, we’ll cross-link to a separate architect-level blog series providing detailed descriptions, test data, and configuration scenarios.

Basic analytics pipeline using a Ceph object store

Our early adopter customers are ingesting, querying, and transforming data directly to and from a shared Ceph object store.  In other words, target data locations for their analytics jobs are something like “s3://bucket-name/path-to-file-in-bucket” within Ceph, instead of something like “hdfs:///path-to-file”.  Direct access to S3-compatible object stores via analytics tools like Spark, Hive, and Impala is made possible via the Hadoop S3A client.

Jointly with several customers, we successfully ran 1000s of analytics jobs directly against a Ceph object store using the following analytics tools:

Figure 1: Analytics tools tested with shared Ceph object store

In addition to running simplistic tests like TestDFSIO, we wanted to run analytics jobs which were representative of real-world workloads. To do that, we based our tests on the TPC-DS benchmark for ingest, transformation, and query jobs. TPC-DS generates synthetic data sets and provides a set of sample queries intended to model the analytics environment of a large retail company with sales operations from stores, catalogs, and the web.  Its schema has 10s of tables, with billions of records in some tables. It defines 99 pre-configured queries, from which we selected the 54 most IO-intensive for out tests. With partners in industry, we also supplemented the TPC-DS data set with simulated click-stream logs, 10x larger than the TPC-DS data set size, and added SparkSQL jobs to join these logs with TPC-DS web sales data.

In summary, we ran the following directly against a Ceph object store:

  • Bulk Ingest (bulk load jobs – simulating high volume streaming ingest at 1PB+/day)
  • Ingest (MapReduce jobs)
  • Transformation (Hive or SparkSQL jobs which convert plain text data into Parquet or ORC columnar, compressed formats)
  • Query (Hive or SparkSQL jobs – frequently run in batch/non-interactive mode, as these tools automatically restart failed jobs)
  • Interactive Query (Impala or Presto jobs)
  • Merge/join (Hive or SparkSQL jobs joining semi-structured click-stream data with structured web sales data)

Architecture overview

We ran variations of the tests outlined above with 4 large customers over the past year. Generally speaking, our architecture looked something like this:

Figure 2: High-level lab architecture

Did it work?

Yes.  1000s of analytics jobs described above completed successfully.  SparkSQL, Hive, MapReduce, and Impala jobs all using the S3A client to read and write data directly to a shared Ceph object store.  The related architect-level blog series will document detailed lessons learned and configuration techniques.

In the final episode of this blog series, we’ll get to the punch line – what was the performance compared to traditional HDFS?  Stay tuned for part 3 of this series….

Why Spark on Ceph? (Part 1 of 3)

A couple years ago, a few big companies began to run Spark and Hadoop analytics clusters using shared Ceph object storage to augment and/or replace HDFS.

We set out to find out why they were doing it and how it performs.

Specifically, we wanted to know first-hand answers to the following three questions:

  1. Why would companies do this? (this blog post)
  2. Will mainstream analytics jobs run directly against a Ceph object store? (see “Why Spark on Ceph? (Part 2 of 3)”)
  3. How much slower will it run than natively on HDFS? (see “Why Spark on Ceph? (Part 3 of 3)”)

We’ll provide summary-level answers to these questions in a 3-part blog series.  In addition, for those wanting more depth, we’ll cross-link to a separate reference architecture blog series providing detailed descriptions, test data, and configuration scenarios.

Part 1: Why would companies do this?

Agility of many, the power of one.
The agility of many analytics clusters, with the power of one shared data store.
(Ok … enough with the simplistic couplets.)

Here are a few common problems that emerged from speaking with 30+ companies:

  • Teams that share the same analytics cluster are frequently frustrated because someone else’s job often prevents their job from finishing on-time.
  • In addition, some teams want the stability of older analytic tool versions on their clusters, while their peer teams need to load the latest-and-greatest tool releases.
  • As a result, many teams demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own needs.
  • However, each separate analytics cluster typically has its own, non-shared HDFS data store – creating data silos.
  • And to provide access to the same data sets across the silos, the data platform team frequently copies datasets between the HDFS silos, trying to keep them consistent and up-to-date.
  • As a result, companies end up maintaining many separate, fixed analytics clusters (50+ in one case), each with their own HDFS data silo containing redundant copies of PBs of data, while maintaining an error-prone maze of scripts to keep data sets updated across silos.
  • But, the resulting cost of maintaining 5, 10, or 20 copies of multi-PB datasets on the various HDFS silos is cost prohibitive to many companies (both CapEx and OpEx).

In pictures, their core problems and resulting options look something like this:

Figure 1. Core problems

 

Figure 2. Resulting Options

Turns out that the AWS ecosystem built a solution for choice #3 (see Figure 2 above) years ago through the Hadoop S3A filesystem client.  In AWS, you can spin-up many analytics clusters on EC2 instances, and share data sets between them on Amazon S3 (e.g. see Cloudera CDH support for Amazon S3).  No more lengthy delays hydrating HDFS storage after spinning-up new clusters, or de-staging HDFS data upon cluster termination.  With the Hadoop S3A filesystem client, Spark/Hadoop jobs and queries can run directly against data held within a shared S3 data store.  

Bottom-line … more-and-more data scientists and analysts are accustomed to spinning-up analytic clusters quickly on AWS with access to shared data sets, without time-consuming HDFS data-hydration and de-stage cyles, and expect the same capability on-premises.

Ceph is the #1 open-source, private-cloud object storage platform, providing S3-compatible object storage.  It was (and is) the natural choice for these companies looking to provide an S3-compatible shared data lake experience to their analysts on-premises.

Stay tuned for the next blog in the series, ‘Why Spark on Ceph? (Part 2 of 3)Will mainstream analytics jobs run directly against a Ceph object store?)’

Introducing Red Hat Storage One

More than a year ago, our Storage Architecture team set out to answer the question of how we can overcome the last barriers to software-defined storage (SDS) adoption. We know from our thousands of test cycles and hundreds of hours of data analysis that a properly deployed Gluster or Ceph system can easily compete with—and often surpass—the feature and performance capabilities of any proprietary storage appliance, usually at a fraction of the cost based on our experience and rigorous study. We have many customer success stories to back up these claims with real-word deployments for enterprise workloads. However, one piece of feedback is consistent and clear: Not every customer is willing or prepared to commit the resources necessary to architect a system to the best standards for their use case. The barrier to entry is simply higher than for a comparative proprietary appliance that is sold in units of storage with specific workload-based performance metrics. The often-lauded flexibility of SDS is, in these cases, its Achilles’ heel.

Continue reading “Introducing Red Hat Storage One”

Container-native storage 3.9: Enhanced day-2 management and unified installation within OpenShift Container Platform

By  Annette Clewett, Humble Chirammal, Daniel Messer, and Sudhir Prasad

With today’s release of Red Hat OpenShift Container Platform 3.9, you will now have the convenience of deploying Container-Native Storage (CNS) 3.9 built on Red Hat Gluster Storage as part of the normal OpenShift deployment process in a single step. At the same time, major improvements in ease of operation have been introduced to give you the ability to monitor provisioned storage consumption, expand persistent volume (PV) capacity without downtime, and use a more intuitive naming convention for persistent volume names.

Continue reading “Container-native storage 3.9: Enhanced day-2 management and unified installation within OpenShift Container Platform”

The third one’s a charm

By Federico Lucifredi, Red Hat Storage

 

 

Red Hat Ceph Storage 3 is our annual major release of Red Hat Ceph Storage, and it brings great new features to customers in the areas of containers, usability, and raw technology horsepower. It includes support for CephFS, giving us a comprehensive, all-in-one storage solution in Ceph spanning block, object, and file alike. It introduces iSCSI support to provide storage to platforms like VMWare ESX and Windows Server that currently lack native Ceph drivers. And we are introducing support for client-side caching with dm-cache.

On the usability front, we’re introducing new automation to manage the cluster with less user intervention (dynamic bucket sharding), a troubleshooting tool to analyze and flag invalid cluster configurations (Ceph Medic), and a new monitoring dashboard (Ceph Metrics) that brings enhanced insight into the state of the cluster.

Last, but definitely not least, containerized storage daemons (CSDs) drive a significant improvement in TCO through better hardware utilization.

Containers, containers, never enough containers!

We graduated to fully supporting our Ceph distribution running containerized in Docker application containers earlier in June 2017 with the 2.3 release, after more than a year of open testing of tech preview images.

Red Hat Ceph Storage 3 raises the bar by introducing colocated CSDs as a supported configuration. CSDs drive a significant TCO improvement through better hardware utilization; the baseline object store cluster we recommend to new users spans 10 OSD storage nodes, 3 MON controllers, and 3 RGW S3 gateways. By allowing colocation, the smaller MON and RGW nodes can now run colocated on the OSDs’ hardware, allowing users to avoid not only the capital expense of those servers but the ongoing operational cost of managing those servers. Pricing that configuration using a popular hardware vendor, we estimate that users could experience a 24% hardware cost reduction or, in alternative, add 30% more raw storage for the same initial hardware invoice.

“All nodes are storage nodes now!”

We are accomplishing this improvement by colocating any of the Ceph scale-out daemons on the storage servers, one per host. Containers reserve RAM and CPU allocations that protect both the OSD and the co-located daemon from resource starvation during rebalancing or recovery load spikes. We can currently colocate all the scale-out daemons except the new iSCSI gateway, but we expect that in the short term MON, MGR, RGW, and the newly supported MDS should take the lion’s share of these configurations.

As my marketing manager is found of saying, all nodes are storage nodes now! Just as important, we can field a containerized deployment using the very same ceph-ansible playbooks our customers are familiar with and have come to love. Users can conveniently learn how to operate with containerized storage while still relying on the same tools—and we continue to support RPM-based deployments. So if you would rather see others cross the chasm first, that is totally okay as well—You can continue operating with RPMs and Ansible as you are accustomed to.

CephFS: now fully supportawesome

The Ceph filesystem, CephFS for friends, is the Ceph interface providing the abstraction of a POSIX-compliant filesystem backed by the storage of a RADOS object storage cluster. CephFS achieved reliability and stability already last year, but with this new version, the MDS directory metadata service is fully scale-out, eliminating our last remaining concern to its production use. In Sage Weil’s own words, it is now fully awesome!

“CephFS is now fully awesome!” —Sage Weil

With this new version, CephFS is now fully supported by Red Hat. For details about CephFS, see the Ceph File System Guide for Red Hat Ceph Storage 3. While I am on the subject, I’d like to give a shout-out to the unsung heroes in our awesome storage documentation team: They have steadily introduced high-quality guides with every release, and our customers are surely taking notice.

iSCSI and NFS: compatibility trifecta

Earlier this year, we introduced the first version of our NFS gateway, allowing a user to mount an S3 bucket as if it was an NFS folder, for quick bulk import and export of data from the cluster, as literally every device out there speaks NFS natively. In this release, we’re enhancing the NFS gateway with support for NFS v.3 alongside the existing NFS v.4 support.

The remaining leg of our legacy compatibility plan is iSCSI. While iSCSI is not ideally suited to a scale-out system like Ceph, the use of multipathing for failover makes the fit smoother than one would expect, as no explicit HA is needed to manage failover.

With Red Hat Ceph Storage 3, we’re bringing to GA the iSCSI gateway service that we have been previewing during the past year. While we continue to favor the LibRBD interface as it is more featureful and delivers better performance, iSCSI makes perfect sense as a fall-back to connect VMWare and Windows servers to Ceph storage, and generally anywhere a native Ceph block driver is not yet available. With this initial release, we are supporting VMWare ESX 6.5, Windows Server 2016, and RHV 4.x over an iSCSI interface, and you can expect to see us adding more platforms to the list of supported clients next year as we plan to increase the reach of our automated testing infrastructure.

¡Arriba, arriba! ¡Ándale, ándale!

Red Hat’s famous Performance and Scale team has revisited client-side caching tuning with the new codebase and blessed an optimized configuration for dm-cache that can now be easily configured with Ceph-volume, the new up-and-coming tool that is slated by the Community to eventually give the aging ceph-disk a well-deserved retirement.

Making things faster is important, but equally important is insight into performance metrics. The new dashboard is well deserving of a blog on its own right, but let’s just say that it plainly makes available a significant leap in performance monitoring to Ceph environments, starting with the cluster as a whole and drilling into individual metrics or individual nodes as needed to track down performance issues. Select users have been patiently testing our early builds with Luminous this summer, and their consistently positive feedback makes us confident you will love the results.

Linux monitoring has many flavors, and while we supply tools as part of the product, customers often want to integrate their existing infrastructure, whether it is Nagios alerting in very binary tones that something seems to be wrong, or another tool. For this reason, we joined forces with our partners at Datadog to introduce a joint configuration for SAAS monitoring of Ceph using Datadog’s impressive tools.

Get the stats

More than 30 features are landing in this release alongside our rebasing of the enterprise product to the Luminous codebase. These map to almost 500 bugs in our downstream tracking system and hundreds more upstream in the Luminous 12.2.1 release we started from. I’d like to briefly call attention to about 20 of them that our very dedicated global support team prioritized for us as the most impactful way to further smooth out the experience of new users and move forward on our march toward making Ceph evermore enterprise-ready and easy to use. This is our biggest release yet, and its timely delivery 3 months after the upstream freeze is an impressive achievement for our Development and Quality Assurance engineering teams.

As always, those of you with an insatiable thirst for detail should read the release notes next—and feel free to ping me on Twitter if you have any questions!

The third one’s a charm

Better economics through improved hardware utilization, great value-add for customers in the form of new access modes in file, iSCSI, and NFS compatibility, joined by improved usability and across-the-board technological advancement are the themes we tried to hit with this release. I think we delivered… but we aren’t done yet. We plan to send more stuff your way this year! Stay tuned.

But if you can’t wait to hear more about the new object storage features in Red Hat Ceph Storage 3, read this blog post by Uday Boppana.

  • Page 1 of 2
  • 1
  • 2
  • >