Running OpenShift Container Storage 3.10 with Red Hat OpenShift Container Platform 3.10

By Annette Clewett anJose A. Rivera

With the release of Red Hat OpenShift Container Platform 3.10, we’ve officially rebranded what used to be referred to as Red Hat Container-Native Storage (CNS) as Red Hat OpenShift Container Storage (OCS). Versioning remains sequential (i.e, OCS version 3.10 is the follow on to CNS 3.9). You’ll continue to have the convenience of OCS 3.10 as part of the normal OpenShift deployment process in a single step, and OpenShift Container Platform (OCP) evaluation subscription has access to OCS evaluation binaries and subscriptions.

OCS 3.10 introduces an important feature for container-based storage with OpenShift. Arbiter volume support will provide the same level of resiliency as regular GlusterFS volumes but with 30% less storage required. This release also hardens block support for backing OpenShift infrastructure services. Detailed information on the value and use of OCS 3.10 features can be found here.

OCS 3.10 installation with OCP 3.10 Advanced Installer

Let’s now take a look at the installation of OCS with the OCP Advanced Installer. OCS can provide persistent storage for both OCP’s infrastructure applications (e.g., integrated registry, logging, and metrics), as well as  general application data consumption. Typically, both options are used in parallel, resulting in two separate OCS clusters being deployed in a single OCP environment. It’s also possible to use a single OCS cluster for both purposes.

Following is an example of a partial inventory file with selected options concerning deployment of OCS for applications and an additional OCS cluster for infrastructure workloads like registry, logging, and metrics storage. When using these options for your deployment, values with specific sizes (e.g., openshift_hosted_registry_storage_volume_size=10Gi) or node selectors  (e.g., node-role.kubernetes.io/infra=true) should be adjusted for your particular deployment needs.

If you’re planning to use gluster-block volumes for logging and metrics, they can now be installed when OCP is installed. (Of course, they can also be installed later.)

[OSEv3:children]
...
nodes
glusterfs
glusterfs_registry

[OSEv3:vars]
...      
# registry
openshift_hosted_registry_storage_kind=glusterfs       
openshift_hosted_registry_storage_volume_size=10Gi   
openshift_hosted_registry_selector="node-role.kubernetes.io/infra=true"

# logging
openshift_logging_install_logging=true
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_size=50Gi
openshift_logging_es_cluster_size=3
openshift_logging_es_pvc_storage_class_name='glusterfs-registry-block'
openshift_logging_kibana_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_curator_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"}
# metrics
openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=dynamic
openshift_metrics_storage_volume_size=20Gi
openshift_metrics_cassandra_pvc_storage_class_name='glusterfs-registry-block'
openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/infra": "true"}
# Container image to use for glusterfs pods
openshift_storage_glusterfs_image="registry.access.redhat.com/rhgs3/rhgs-server-rhel7:v3.10"

# Container image to use for gluster-block-provisioner pod
openshift_storage_glusterfs_block_image="registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7:v3.10"

# Container image to use for heketi pods
openshift_storage_glusterfs_heketi_image="registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.10"
 
# OCS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=false   
# OCS storage cluster for OpenShift infrastructure
openshift_storage_glusterfs_registry_namespace=infra-storage  
openshift_storage_glusterfs_registry_storageclass=false       
openshift_storage_glusterfs_registry_block_deploy=true   
openshift_storage_glusterfs_registry_block_host_vol_create=true    
openshift_storage_glusterfs_registry_block_host_vol_size=200   
openshift_storage_glusterfs_registry_block_storageclass=true
openshift_storage_glusterfs_registry_block_storageclass_default=false

...
[nodes]
ose-app-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node03.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node04.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node01.ocpgluster.com openshift_node_group_name="node-config-infra"
ose-infra-node02.ocpgluster.com openshift_node_group_name="node-config-infra"
ose-infra-node03.ocpgluster.com openshift_node_group_name="node-config-infra"
[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs_registry]
ose-infra-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'

Inventory file options explained

The first section of the inventory file defines the host groups the installation will be using. We’ve defined two new groups: (1) glusterfs and (2) glusterfs_registry. The settings for either group all start with either openshift_storage_glusterfs_ or openshift_storage_glusterfs_registry. In each group, the nodes that will make up the OCS cluster are listed, and the devices ready for exclusive use by OCS are specified (glusterfs_devices=).

The first group of hosts in glusterfs specifies a cluster for general-purpose application storage and will, by default, come with the StorageClass glusterfs-storage to enable dynamic provisioning. For high availability of storage, it’s very important to have four nodes for the general-purpose application cluster, glusterfs.

The second group, glusterfs_registry, specifies a cluster that will host a single, statically deployed PersistentVolume for use exclusively by a hosted registry that can scale. This cluster will not offer a StorageClass for file-based PersistentVolumes with the options and values as they are currently configured (openshift_storage_glusterfs_registry_storageclass=false). This cluster will also support gluster-block (openshift_storage_glusterfs_registry_block_deploy=true). PersistentVolume creation can be done via StorageClass glusterfs-registry-block (openshift_storage_glusterfs_registry_block_storageclass=true). Special attention should be given to choosing the size for openshift_storage_glusterfs_registry_block_host_vol_size. This is the hosting volume for gluster-block devices that will be created for logging and metrics. Make sure that the size can accommodate all these block volumes and that you have sufficient storage if another hosting volume must be created.

If you want to tune the installation, more options are available in the Advanced Installation. To automate the generation of required inventory file options as shown previously, check out this newly available red-hat-storage tool called “CNS Inventory file Creator” or CIC (alpha version at this time). The CIC tool creates CNS or OCS inventory file options for both OCP 3.9 and OCP 3.10, respectively. CIC will ask a series of questions about the OpenShift hosts, the storage devices, sizes of PersistentVolumes for registry, logging and metrics and has baked-in checks to make sure the OCP installation will be successful. This tool  is currently alpha state, and we’re looking for feedback. Download it from github repository openshift-cic.

Single OCS cluster installation

Again, it is possible to support both general-application storage and infrastructure storage in a single OCS cluster. To do this, the inventory file options will change slightly for logging and metrics. This is because when there is only one cluster, the gluster-block StorageClass would be glusterfs-storage-block. The registry PV will be created on this single cluster if the second cluster, [glusterfs_registry], does not exist. For high availability, it’s very important to have four nodes for this cluster.  Also, special attention should be given to choosing the size for openshift_storage_glusterfs_block_host_vol_size. This is the hosting volume for gluster-block devices that will be created for logging and metrics. Make sure that the size can accommodate all these block volumes and that you have sufficient storage if another hosting volume must be created.

[OSEv3:children]
...
nodes
glusterfs

[OSEv3:vars]
...      
# registry
...

# logging
openshift_logging_install_logging=true
...
openshift_logging_es_pvc_storage_class_name='glusterfs-storage-block'
... 

# metrics
openshift_metrics_install_metrics=true
...
openshift_metrics_cassandra_pvc_storage_class_name='glusterfs-storage-block'
...

# OCS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=true
openshift_storage_glusterfs_block_host_vol_create=true
openshift_storage_glusterfs_block_host_vol_size=100
openshift_storage_glusterfs_block_storageclass=true
openshift_storage_glusterfs_block_storageclass_default=false
...

[nodes]

ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

OCS 3.10 uninstall

With the OCS 3.10 release, the uninstall.yml playbook can be used to remove all gluster and heketi resources. This might come in handy when there are errors in inventory file options that cause the gluster cluster to deploy incorrectly.

If you’re removing an OCS installation that is currently being used by any applications, you should remove those applications before removing OCS, because they will lose access to storage. This includes infrastructure applications like registry, logging, and metrics that have PV claims created using the glusterfs-storage and glusterfs-storage-block Storage Class resources.

You can remove logging and metrics resources by re-running the deployment playbooks like this:

ansible-playbook -i <path_to_inventory_file> -e
"openshift_logging_install_logging=false"
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

ansible-playbook -i <path_to_inventory_file> -e
"openshift_logging_install_metrics=false"
/usr/share/ansible/openshift-ansible/playbooks/openshift-metrics/config.yml

Make sure to manually remove any logging or metrics PersistentVolumeClaims. The associated PersistentVolumes will be deleted automatically.

If you have the registry using a glusterfs PersistentVolume, remove it with the following command:

oc delete deploymentconfig docker-registry
oc delete pvc registry-claim
oc delete pv registry-volume
oc delete service glusterfs-registry-endpoints

If running the uninstall.yml because a deployment failed, run the uninstall.yml playbook with the following variables to wipe the storage devices for both glusterfs and glusterfs_registry before trying the OCS installation again.

ansible-playbook -i <path_to_inventory file> -e
"openshift_storage_glusterfs_wipe=True" -e
"openshift_storage_glusterfs_registry_wipe=true"
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml

OCS 3.10 post installation for applications, registry, logging and metrics

You can add OCS clusters and resources to an existing OCP install using the following command. This same process can be used if OCS has been uninstalled due to errors.

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml

After the new cluster(s) is created and validated, you can deploy the registry using a newly created glusterfs ReadWriteMany volume. Run this playbook to create the registry resources:

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-hosted/config.yml

You can now deploy logging and metrics resources by re-running these deployment playbooks:

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-metrics/config.yml

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also, watch this short video explaining why to use OCS with OCP. Detailed information on the value and use of OCS 3.10 features can be found here.

Improved volume management for Red Hat OpenShift Container Storage 3.10

By Annette Clewett and Husnain Bustam

Hopefully by now you’ve seen that with the release of Red Hat OpenShift Container Platform 3.10 we’ve rebranded our container-native storage (CNS) offering to be called Red Hat OpenShift Container Storage (OCS). Versioning remains sequential (i.e, OCS 3.10 is the follow on to CNS 3.9).

OCS 3.10 introduces important features for container-based storage with OpenShift. Arbiter volume support will provide the same level of resiliency as regular GlusterFS volumes but with 30% less storage required. This release also hardens block support for backing OpenShift infrastructure services. In addition to supporting arbiter volumes, major improvements to ease operations are available to give you the ability to monitor provisioned storage consumption, expand persistent volume (PV) capacity without downtime to the application, and use a more intuitive naming convention for PVs.

For easy evaluation of these features, an OpenShift Container Platform evaluation subscription now includes access to OCS evaluation binaries and subscriptions.

New features

Now let’s dive deeper into the new features of the OCS 3.10 release:

  • Prometheus OCS volume metrics: Volume consumption metrics data (e.g., volume capacity, available space, number of inodes in use, number of inodes free) available in Prometheus for OCS are very useful. These metrics monitor storage capacity and consumption trends and take timely actions to ensure applications do not get impacted.
  • Heketi topology and configuration metrics: Available from the Heketi HTTP metrics service endpoint, these metrics can be viewed using Prometheus or curl http://<heketi_service_route>/metrics. These metrics can be used to query heketi health, number of nodes, number of devices, device usage, and cluster count.
  • Online expansion of provisioned storage: You can now expand the OCS-backed PVs within OpenShift by editing the corresponding claim (oc edit pvc <claim_name>) with the new desired capacity (spec→ requests → storage: new value).
  • Custom volume naming: Before this release, the names of the dynamically provisioned GlusterFS volumes were auto-generated with random uuid number. Now, by adding a custom volume name prefix, the GlusterFS volume name will include the namespace or project as well as the claim name, thereby making it much easier to map to a particular workload.
  • Arbiter volumes: Arbiter volumes allow for reduced storage consumption and better performance across the cluster while still providing the redundancy and reliability expected of GlusterFS.

Volume and Heketi metrics

As of OCP 3.10 and OCS 3.10, the following metrics are available in Prometheus (and by executing curl http://<heketi_service_route>/metrics):

kubelet_volume_stats_available_bytes:      Number of available bytes in the volume
kubelet_volume_stats_capacity_bytes: Capacity in bytes of the volume
kubelet_volume_stats_inodes: Maximum number of inodes in the volume
kubelet_volume_stats_inodes_free: Number of free inodes in the volume
kubelet_volume_stats_inodes_used: Number of used inodes in the volume
kubelet_volume_stats_used_bytes: Number of used bytes in the volume
heketi_cluster_count: Number of clusters
heketi_device_brick_count: Number of bricks on device
heketi_device_count: Number of devices on host
heketi_device_free: Amount of free space available on the device
heketi_device_size: Total size of the device
heketi_device_used: Amount of space used on the device
heketi_nodes_count: Number of nodes on the cluster
heketi_up: Verifies if heketi is running
heketi_volumes_count: Number of volumes on cluster

 

 

Populating Heketi metrics in Prometheus requires additional configuration of the Heketi service. You must add the bolded annotations using the following commands:

# oc annotate svc heketi-storage prometheus.io/scheme=http
# oc annotate svc heketi-storage prometheus.io/scrape=true
# oc describe svc heketi-storage
Name:           heketi-storage
Namespace:      app-storage
Labels:         glusterfs=heketi-storage-service
                heketi=storage-service
Annotations:    description=Exposes Heketi service
                prometheus.io/scheme=http
                prometheus.io/scrape=true
Selector:       glusterfs=heketi-storage-pod
Type:           ClusterIP
IP:             172.30.90.87
Port:           heketi  8080/TCP
TargetPort:     8080/TCP

Populating Heketi metrics in Prometheus also requires additional configuration of the Prometheus configmap. As shown in the following, you must modify the Prometheus configmap with the namespace of Hekti service and restart prometheus-0 pod:

# oc get svc --all-namespaces | grep heketi
appstorage       heketi-storage       ClusterIP 172.30.90.87  <none>  8080/TCP
# oc get cm prometheus -o yaml -n openshift-metrics
....
- job_name: 'kubernetes-service-endpoints'
   ...
   relabel_configs:
     # only scrape infrastructure components
     - source_labels: [__meta_kubernetes_namespace]
       action: keep
       regex: 'default|logging|metrics|kube-.+|openshift|openshift-.+|app-storage'
# oc scale --replicas=0 statefulset.apps/prometheus
# oc scale --replicas=1 statefulset.apps/prometheus

Online expansion of GlusterFS volumes and custom naming

First, let’s discuss what’s needed to allow expansion of GlusterFS volumes. This opt-in feature is enabled by configuring the StorageClass for OCS with the parameter allowVolumeExpansion set to “true,” enabling the feature gate ExpandPersistentVolumes. You can now dynamically resize storage volumes attached to containerized applications without needing to first detach and then attach a storage volume with increased capacity, which enhances application availability and uptime.

Enable the ExpandPersistentVolumes feature gate on all master nodes:

# vim /etc/origin/master/master-config.yaml
kubernetesMasterConfig:
  apiServerArguments:
    feature-gates:
    - ExpandPersistentVolumes=true
# /usr/local/bin/master-restart api
# /usr/local/bin/master-restart controllers

This release also supports adding a custom volume name prefix created with the volume name prefix, project name/namespace, claim name, and UUID (<myPrefix>_<namespace>_<claimname>_UUID). Parameterizing the StorageClass ( `volumenameprefix: myPrefix`) allows easier identification of volumes in the GlusterFS backend.

The new OCS PVs will be created with the volume name prefix, project name/namespace, claim name, and UUID (<myPrefix>_<namespace>_<claimname>_UUID), making it easier for you to automate day-2 admin tasks like backup and recovery, applying policies based on pre-ordained volume nomenclature, and other day-2 housekeeping tasks.

In this StorageClass, support for both online expansion of OCS/GlusterFS PVs and custom volume naming has been added.

# oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: glusterfs-storage
parameters:
  resturl: http://heketi-storage-storage.apps.ose-master.example.com
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: storage
  volumenameprefix: gf 
allowVolumeExpansion: true 
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete

❶ Custom volume name support: <volumenameprefixstring>_<namespace>_<claimname>_UUID
Parameter needed for online expansion or resize of GlusterFS PVs

Be aware that PV expansion is not supported for block volumes, only for file volumes.

Expanding a volume starts with editing the PVC field “requests:storage” with the new expanded size for the PersistentVolume. For example, we have 1GiB PV, we want to expand the PV to 2GiB. To expand/resize PV to 2GiB, edit the PVC field “requests:storage” with the new value. The PV will be automatically resized to 2GiB. The new 2GiB size will be reflected in OCP, heketi-cli, and gluster commands. The expansion process creates another replica set and converts the 3-way replicated volume to distributed-replicated volume, 2×3 instead of 1×3 bricks.

GlusterFS arbiter volumes

Arbiter volume support is new to OCS 3.10 and has the following advantages:

  • An arbiter volume is still a 3-way replicated volume for highly available storage.
  • Arbiter bricks do not store file data; they only store file names, structure, and metadata.
  • Arbiter uses client quorum to compare this metadata with metadata of other nodes to ensure consistency of the volume and prevent split brain conditions.
  • Using Heketi commands, it is possible to control arbiter brick placement using tagging so that all arbiter bricks are on the same node.
  • With control of arbiter brick placement, the ‘arbiter’ node can have limited storage compared to other nodes in the cluster.

The following example has two gluster volumes configured across 5 nodes to create two 3-way arbitrated replicated volumes, with the arbiter bricks on a dedicated arbiter node.

In order to use arbiter volumes with OCP workloads, an additional parameter must be added to the GlusterFS StorageClass, user.heketi.arbiter true. In this StorageClass, support for the online expansion of GlusterFS PVs, custom volume naming, and arbiter volumes have been added.

# oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: glusterfs-storage
parameters:
  resturl: http://heketi-storage-storage.apps.ose-master.example.com
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: storage
  volumenameprefix: gf 
  volumeoptions: user.heketi.arbiter true ❸
allowVolumeExpansion: true 
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete

❶ Custom volume name support: <volumenameprefixstring>_<namespace>_<claimname>_UUID
Parameter needed for online expansion or resize of GlusterFS volumes
❸  Enable arbiter volume support in the StorageClass. All the PVs created from this StorageClass will be 3-way arbitrated replicated volume.

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also, check out  this short video explaining why using OCS with OpenShift is the right choice for the container storage infrastructure. For details on running OCS 3.10 with OCP 3.10, click here.

Breaking down data silos with Red Hat infrastructure

By Brent Compton, Senior Director, Technical Marketing, Red Hat Cloud Storage and HCI

Breaking down barriers to innovation.
Breaking down data silos.

These are arguably two of the top items on many enterprises’ wish lists. In the world of analytics infrastructure, people have described a solution to these needs as “multi-tenant workload isolation with shared storage.” Several public-cloud-based analytics solutions exist to provide this. However, many large Red Hat customers are doing large-scale analytics in their own data centers and were unable to solve these problems with their on-premises analytic infrastructure solutions. They turned to Red Hat private cloud platforms as their analytics infrastructure and achieved just this: multi-tenant workload isolation with shared storage. To be clear, Red Hat is not providing these customers with analytics tools. Instead, it is welcoming these analytics tools onto the same Red Hat infrastructure platforms running much of the rest of their other enterprise workloads.

Traditional on-premises analytics infrastructures do not provide on-demand provisioning for short-running analytics workloads, frequently needed by data scientists. In addition, traditional HDFS-based infrastructures do not share storage between analytics clusters. As such, traditional analytics infrastructures often don’t meet the competing needs of multiple teams needing different types of clusters, all with access to common data sets. Individual teams can end up competing for the same set of cluster resources, causing congestion in busy analytics clusters, leading to frustration and delays in getting insights from their data.

As a result, a team may demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own workload needs. Without a shared storage repository, this can lead to multiple analytic cluster silos, each with its own copy of data. Net result? Cost duplication and the burden of maintaining and tracking multiple data set copies.

An answer to these challenges? Bring your analytics workloads onto a common, scalable infrastructure.

Red Hat has seen customers solve these challenges by breaking down traditional Hadoop silos and bringing analytics workloads onto a common, private cloud infrastructure running in today’s enterprise datacenters. At its core is Red Hat Ceph Storage, our massively scalable, software-defined object storage platform, which enables organizations to more easily share large-scale data sets between analytics clusters. The on-demand provisioning of virtualized analytics clusters is enabled through Red Hat OpenStack Platform. Additionally, early adopters are deploying Apache Spark in kubernetes-orchestrated, container-based clusters via Red Hat OpenShift Container Platform. Delivery and support are provided by the IT experts at Red Hat Consulting based on documented leading practices to help establish an optimal architecture for our clients’ unique requirements.

Key benefits to customers

Agility

  • Get answers faster. By enabling teams to elastically provision their own dedicated analytics compute resources via Red Hat OpenStack Platform, teams have avoided cluster resource competition in order to better meet service-level agreements (SLAs). And teams can spin up these new analytics clusters without lengthy data-hydration delays (made possible by accessing shared data sets on Red Hat Ceph Storage).
  • Remove roadblocks. Empower teams of data scientists to use the analytics tools/versions they need through dynamically provisioned data labs and workload clusters (while still accessing shared data sets).
  • Hybrid cloud versatility. Enable your query authors to use the same S3 syntax in their queries, whether running on a private cloud or public cloud. Spark and other popular analytics tools can use the Hadoop S3A client to access data in S3-compatible object storage, in place of native HDFS. Ceph is the most popular S3-compatible open-source object storage backend for OpenStack.

Cost/risk reduction

  • Cut costs associated with data set duplication. In traditional Hadoop/Spark HDFS clusters, data is not shared. If a data scientist wants to analyze data sets that exists in two different clusters, they may need to copy data sets from one cluster to the other. This can result in duplicate costs for multi-PB data sets that must be copied among many analytics clusters.
  • Reduce risks of maintaining duplicate data sets. Duplicate data-set maintenance can be time-consuming and prone to error, but it can also result in incomplete or inaccurate insights being derived from stale data.
  • Scale costs based on requirements. In traditional Hadoop/Spark HDFS clusters, capacity is added by procuring more HDFS nodes with a fixed ratio of CPU and storage capacity. With Red Hat data analytics infrastructure, customers can provision compute servers separately from a common storage pool and thus can scale each resource according to need. By freeing storage capacity from compute cores previously locked together, companies can scale storage capacity costs independently of compute costs according to need.

Innovation for today’s data needs

As data continues to grow, organizations should have a supporting infrastructure that can break down data silos and enable teams to access and use information in more agile ways. Red Hat platforms can foster greater agility, efficiency, and savings–a nice combination for today’s data-driven organizations looking to build analytics applications across the open hybrid cloud.

You can also find our blog post that covers other news from the Strata conference and upstream community projects here. For more details on empirical test results, see here. For a video whiteboard of these topics, see here. Finally, To learn more, visit www.redhat.com/bigdata.

 

Introducing Red Hat Gluster Storage 3.4: Feature overview

By Anand Paladugu, Principal Product Manager

We’re pleased to announce that Red Hat Gluster Storage 3.4 is now Generally Available!

Since this release is a full rebase with the upstream, it consolidates many bug fixes, thus giving you a greater degree of overall stability for both container storage and traditional file serving use cases. Given that Red Hat OpenShift Container Storage is based on Red Hat Gluster Storage, these fixes will also be embedded in the 3.10 release of OpenShift Container Storage. To enable you to refresh your Red Hat Enterprise Linux (RHEL) 6-based Red Hat Gluster Storage installations, this release supports upgrading your Red Hat Gluster Storage servers from RHEL 6 to RHEL 7. Last, you can now deploy Red Hat Gluster Storage Web Administrator with minimal resources, which also offers robust and feature-rich monitoring capabilities.

Here is an overview of the new features delivered in Red Hat Gluster Storage 3.4:

Support for upgrading Red Hat Gluster Storage from RHEL 6 to RHEL 7

Many customers like to ensure they’re on the latest and greatest RHEL in their infrastructures. Two scenarios are now supported for upgrading RHEL servers in a Red Hat Gluster Storage deployment from RHEL 6 to RHEL 7:

  1. Red Hat Gluster Storage version is <= 3.3.x and the underlying RHEL version is <= latest version of 6.x. The upgrade process updates Red Hat Gluster Storage to version 3.4 and the underlying RHEL version to the latest version of RHEL 7.
  2. Red Hat Gluster Storage version is 3.4 and the underlying RHEL version is the latest version of 6.x. The upgrade process keeps the Red Hat Gluster Storage version at 3.4 and upgrades the underlying RHEL version to the latest version of RHEL 7.

MacOS client support

Mac workstations continue to make inroads into corporate infrastructures. Red Hat Gluster Storage 3.4 supports MacOS as a Server Message Block (SMB) client and thereby allows customers to map SMB shares backed by Red Hat Gluster Storage in the MAC finder tool.

Punch hole support for third-party applications

The “punch hole” feature provides the benefit of freeing up physical disk space when portions of a file are de-referenced. For example, suppose you’ve used up 20 Gigs of your disk space for backing up a file, and some portions of the file are de-referenced due to data duplication. Without punch hole support, the 20 Gigs remain occupied in the underlying physical hard disk. With support for punch holes, however, third-party applications can “punch a hole” corresponding to the portions of the deleted files, thereby freeing up physical disk space. This further helps to reduce storage costs associated with backing up and archiving those virtual machines (VMs).

Subdirectory exports using the Gluster Fuse protocol now fully supported

Beginning with Red Hat Gluster Storage 3.4, subdirectory export using Fuse is now fully supported. This feature provides namespace isolation where a single Gluster volume can be shared to many clients, and they can be mounting only a subset of the volume (namespace) (i.e., a subdirectory). You can also export a subdirectory of the already exported volume, to utilize space left in the volume for a different project.

Red Hat Gluster Storage web admin enhancements

The Web Administration tool delivers browser-based graphing, trending, monitoring, and alerting for Red Hat Gluster Storage in the enterprise. This latest Red Hat Gluster Storage release optimizes this web admin tool to consume fewer resources and allow greater scaling to monitor larger clusters than in the past.

Faster directory lookups using the Gluster NFS-Ganesha server

In Red Hat Gluster Storage 3.4, the Readdirp API is extended and enhanced to return handles along with directory stats as part of its reply, thereby reducing NFS operations latency.

In internal testing, performance gains were noticed for all directory operations when compared to Red Hat Gluster Storage 3.3.1. For example, make directory operations improved by up to 31%, file create operations have improved by up to 42%, and file read operations have improved by up to 150%.

Want to learn more?

For hands-on experience with Red Hat Gluster Storage, check out our test drive.

Red Hat OpenShift Container Platform 3.10 with Container-Native Storage 3.9

This post documents how to install Container-Native Storage 3.9 (CNS 3.9) with OpenShift Container Platform 3.10 (OCP 3.10). CNS provides persistent storage for OCP’s general-application consumption and for the registry.

CNS 3.9 installation with OCP 3.10 advanced installer

The deployment of CNS 3.9 can be accomplished using openshift-ansible playbooks and specific inventory file options. The first group of hosts in glusterfs specifies a cluster for general-purpose application storage and will, by default, come with the StorageClass glusterfs-storage to enable dynamic provisioning. For high availability of storage, it’s very important to have four nodes for the general-purpose application cluster, glusterfs. The second group, glusterfs_registry, specifies a cluster that will host a single, statically deployed PersistentVolume for use exclusively by a hosted registry that can scale. This cluster will not offer a StorageClass for file-based PersistentVolumes with the options and values as they are currently configured.

Following is an example of a partial inventory file with selected options concerning deployment of CNS 3.9 for applications and registry. When using options for deployment with values of specific sizes, (e.g., openshift_hosted_registry_storage_volume_size=10Gi) or node selectors, (e.g., node-role.kubernetes.io/infra=true) they should be adjusted for your particular deployment needs.

[OSEv3:children]
...
nodes
glusterfs
glusterfs_registry

[OSEv3:vars]
...      
# registry
openshift_hosted_registry_storage_kind=glusterfs       
openshift_hosted_registry_storage_volume_size=10Gi   
openshift_hosted_registry_selector="node-role.kubernetes.io/infra=true"

# Container image to use for glusterfs pods
openshift_storage_glusterfs_image="registry.access.redhat.com/rhgs3/rhgs-server-rhel7
:v3.9"
# Container image to use for gluster-block-provisioner pod
openshift_storage_glusterfs_block_image="registry.access.redhat.com/rhgs3/rhgs-gluster-
block-prov-rhel7:v3.9"
# Container image to use for heketi pods
openshift_storage_glusterfs_heketi_image="registry.access.redhat.com/rhgs3/rhgs-
volmanager-rhel7:v3.9"
 
# CNS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=false

# CNS storage cluster for OpenShift infrastructure
openshift_storage_glusterfs_registry_namespace=infra-storage  
openshift_storage_glusterfs_registry_storageclass=false       
openshift_storage_glusterfs_registry_block_deploy=false   
openshift_storage_glusterfs_registry_block_host_vol_create=false    
openshift_storage_glusterfs_registry_block_host_vol_size=100  
openshift_storage_glusterfs_registry_block_storageclass=false
openshift_storage_glusterfs_registry_block_storageclass_default=false   

...
[nodes]

ose-app-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node03.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node04.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node03.ocpgluster.com openshift_node_group_name="node-config-compute"

[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs_registry]
ose-infra-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'

CNS 3.9 uninstall

With this release, the uninstall.yml playbook can be used to remove all gluster and heketi resources. This might come in handy when there are errors in inventory file options that cause the gluster cluster to deploy incorrectly.

If you’re removing a CNS installation that is currently being used by any applications, you should remove those applications before removing CNS, because they will lose access to storage. This includes infrastructure applications like registry.

If you have the registry using a glusterfs PersistentVolume, remove it with the following command:

oc delete deploymentconfig docker-registry
oc delete pvc registry-claim
oc delete pv registry-volume
oc delete service glusterfs-registry-endpoints

If running the uninstall.yml because a deployment failed, run the uninstall.yml playbook with the following variables to wipe the storage devices for both glusterfs and glusterfs_registry clusters before trying the CNS installation again:

ansible-playbook -i <path_to_inventory file> -e
"openshift_storage_glusterfs_wipe=True" -e 
"openshift_storage_glusterfs_registry_wipe=true" 
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml

CNS 3.9 post installation for applications and registry

You can add CNS clusters and resources to an existing OCP install using the following command. This same process can be used if CNS has been uninstalled due to errors.

ansible-playbook -i <path_to_inventory_file> 
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml

After the new cluster(s) is created and validated, you can deploy the registry using a newly created glusterfs ReadWriteMany volume. Run this playbook to create the registry resources:

ansible-playbook -i <path_to_inventory_file> 
/usr/share/ansible/openshift-ansible/playbooks/openshift-hosted/config.yml

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also watch this short video explaining why use CNS with OpenShift.

Introducing OpenShift Container Storage: Meet the new boss, same as the old boss!

By Steve Bohac, Product Marketing

Today, we’re introducing Red Hat OpenShift Container Storage 3.10.

Is this product new to you? It surely is—that’s because with the announcement today of Red Hat OpenShift Container Platform 3.10, we’ve rebranded our container-native storage (CNS) offering to now be referred to as Red Hat OpenShift Container Storage. This is still the same product with the strong customer momentum we announced a few months ago during Red Hat Summit week.Why the new name? “Red Hat OpenShift Container Storage” better reflects the product offering and its strong affinity with Red Hat OpenShift Container Platform. Not only does it install with OpenShift (via Red Hat Ansible), it’s developed, qualified, tested, and versioned coincident with OpenShift Container Platform releases. This product name best reflects that strong integration. Again, the product itself didn’t change in any way—all that’s changed is the product name.

Red Hat OpenShift Container Storage enables application portability and a consistent user experience across the hybrid cloud.

This new release, Red Hat OpenShift Container Storage 3.10, is the follow-on to Container-Native Storage 3.9 and introduces three important features for container-based storage with OpenShift: (1) arbiter volume support enabling high availability with efficient storage utilization and better performance, (2) enhanced storage monitoring and configuration visibility using the OpenShift Prometheus framework, and (3) block-backed persistent volumes (PVs) now supported for general application workloads in addition to supporting OCP infrastructure workloads.

If you haven’t already bookmarked our Red Hat Storage blog, now would be a great time! Over the coming weeks, we will be publishing deeper discussions on OpenShift Container Storage. In the meantime, though, for a more thorough understanding of OpenShift Container Storage, check out these recent technical blogs describing in depth the value of our approach to storage for containers:

Want to learn more?

For more information on OpenShift Container Storage, click here. Also, you can find the new Red Hat OpenShift Container Storage datasheet here.

For hands-on experience combining OpenShift and OpenShift Container Storage, check out our test drive, a free, in-browser lab experience that walks you through using both.

For more general information around storage for containers, check out our Container Storage for Dummies book.

Storing tables in Ceph object storage

Introduction

In one of our previous posts, Anatomy of the S3A filesystem client, we showed how Spark can interact with data stored in a Ceph object storage in the same fashion it would interact with Amazon S3. This is all well and good if you plan on exclusively writing applications in PySpark or Scala, but wouldn’t it be great to allow anyone who is familiar with SQL to interact with data stored in Ceph?

That’s what SparkSQL is for, and while Spark has the ability to infer schema, it’s a lot easier if the data is already described in a metadata service like the Hive Metastore. The Hive Metastore stores table schema information, statistics on tables and partitions, and generally aids the query planners of various SQL engines query planners in constructing efficient query plans. So, regardless of whether you’re using good ol’ Hive, SparkSQL, Presto, or Impala, you’ll still be storing and retrieving metadata from a centralized store. Even if your organization has standardized on a single query engine, it still makes sense to have a centralized metadata service, because you’ll likely have distinct workload clusters that will want to share at least some data sets.

Architecture

The Hive Metastore can be housed in a local Apache Derby database for development and experimentation, but a more production-worthy approach would be to use a relational database like MySQL, MariaDB, or Postgres. In the public cloud, a best practice is to store the database tables on a distinct volume to get features like snapshots, and the ability to detach and reattach it to a different instance. In the private cloud, where OpenStack reigns supreme, most folks have turned to Ceph to provide block storage. To learn more about how to leverage Ceph block storage for database workloads, I suggest taking a look at the MySQL reference architecture we authored in conjunction with the open source database experts over at Percona.

While you can configure Hive, Spark, or Presto to interact directly with the MySQL database containing the Metastore, interacting with the Hive Server 2 Thrift service provides better concurrency and an improved security posture. Overall, the general idea is depicted in the following diagram:

Storing tabular data as objects

In a greenfield environment where all data will be stored in the object store, you could simply set hive.metastore.warehouse.dir to a S3A location a la s3a://hive/warehouse. If you haven’t already had a chance to read our Anatomy of the S3A filesystem client post, you should take a look if you’re interested in learning how to configure S3A to interact with a local Ceph cluster instead of Amazon S3. When a S3A location is used as the Metastore warehouse directory, all tables that are created will default to being stored in that particular bucket, under the warehouse pseudo directory. A better approach is to utilize external locations to map databases, tables, or simply partitions to different buckets – perhaps so they can be secured with distinct access controls or other bucket policy features. An example of including a external location specification during table creation might be:

create external table inventory
(
   inv_date_sk bigint,
   inv_item_sk bigint,
   inv_warehouse_sk bigint,
   inv_quantity_on_hand int
)
row format delimited fields terminated by ‘|’
location ‘s3a://tpc/inventory’;

That’s it, when you interact with this inventory table, data will be  directly read from the object store by way of the S3A filesystem client. One of the cool aspects of this approach is the location is abstracted away, you can write queries that scan tables with different locations, or even scan a single table with multiple locations. In this fashion, you might have recent data partitions with a MySQL external location, and data older than the current week in partitions with external locations that point to object storage. Cool stuff!

Serialization, partitions, and statistics

We all want to be able to analyze data sets quickly, and there are a number of tools available to help realize this goal. The first is using different serialization formats. In my discussions with customers, the two most common serialization formats are the columnar formats ORC and Parquet. The gist of these formats is that instead of requiring complete scans of entire files, columns of data are separated into stripes and metadata describing each column’s stripe offsets are stored in a file header or footer. When a query is planned, requests can read in only the stripes that are relevant to that particular query. For a more on different serialization formats, and their relative performance, I highly suggest this analysis by our friends over at Silicon Valley Data Science. We have seen great performance with both Parquet and ORC when used in conjunction with a Ceph object store. Parquet tends to be slightly faster, while ORC tends to use slightly less disk space. This small delta might simply be the result of these formats using different compression algorithms by default (snappy vs ZLIB). Speaking of compression, it’s really easy to think you’re using it, when you are in fact not. Make sure to verify that your tables are actually being compressed. I suggest including the compression specification in table creation statements instead of hoping the engine you are using has the defaults configured the way you want.

In addition to serialization formats, it’s important to consider how your tables are partitioned, and how many files you have per partition. All S3 API calls are RESTful, which means they are heavier weight than HDFS RPC calls. Having fewer larger partitions, with fewer files per partition, will definitely translate into higher throughput and reduced query latency. If you already have tables with loads of partitions, and many files per partition, it might be worthwhile to consolidate them with larger partitions with a fewer files each as you move them into object storage.

With data serialized and partitioned intelligently, queries can be much more efficient, but there is a third way you can help the query planner of your execution engine do its job better – table and column statistics. Table statistics can be collected with ANALYZE TABLE table COMPUTE STATISTICS statements, which count the number of rows for a particular table and their partitions. The row counts are stored in the Metastore, and can be used by other engines that interrogate the Metastore during query planning.

To the cloud!

Getting cloudy

Many modern enterprises have initiatives underway to modernize their IT infrastructures, and today that means moving workloads to cloud environments, whether they be public or private. On the surface, moving data platforms to a cloud environment shouldn’t be a difficult undertaking: Leverage cloud APIs to provision instances, and use those instances like their bare-metal brethren. For popular analytics workloads, this means running storage services in those instances that are specific to analytics and continuing  a siloed approach to storage. This is the equivalent of lift and shift for data-intensive apps, a shortcut approach undertaken by some organizations when migrating an enterprise app to a cloud when they don’t have the luxury of adopting a more contemporary application architecture.

The following data platform principles pertain to moving legacy data platforms to either a public cloud or a private cloud. The private cloud storage platform discussed is Ceph, of course, a popular open-source storage platform used in building private clouds for a variety of data-intensive workloads, including MySQL DBaaS and Spark/Hadoop analytics-as-a-service.

Elasticity

Elasticity is one of the key benefits of cloud infrastructure, and running storage services inside your instances definitely cramps your ability to take advantage of it. For example, let’s say you have an analytics cluster consisting of 100 instances and the resident HDFS cluster has a utilization of 80 percent. Even though you could terminate 10 of those instances and still have sufficient storage space, you would need to rebalance the data, which is often undesirable. You will also be out of luck if months later you also realize that you’re only using half the compute resources of that cluster. If the infrastructure teams make a new instance flavor available, say with fancy GPUs for your hungry machine-learning applications, it’ll be much harder to start consuming them if it entails the migration of storage services.

This is why companies like Netflix decided to use object storage as the source of truth for the analytics applications, as detailed in my previous post What about locality? It enables them to expand, and contract, workload-specific clusters as dictated by their resource requirements. Need a quick cluster with lots of nodes to chew through a one-time ETL? No problem. Need transient data labs for data scientists by day only to relinquish those resources for use for reporting after hours? Easy peasy.

Data infrastructure

Before departing on the journey to cloudify an organization’s data platform architecture, an important first step is assessing the capabilities of the cloud infrastructure, public or private, you intend to consume to make sure it provides the features that are most important to data-intensive applications. World-class data infrastructure provides their tenants with a number of fundamental building blocks that lend power and flexibility to the applications that will sit atop them.

Persistent block storage

Not all data is big, and it’s important to provide persistent block storage for data sets that are well served by database workhorses like MySQL and Postgres. An obvious example is the database used by the Hive metastore. With all these workload clusters being provisioned and deprovisioned, it’s often desirable to have them interact with a common metadata service. For more details about how persistent block storage fits into the dizzying array of architectural decisions facing database administrators, I suggest a read of our MySQL reference architecture.

I also suggest infrastructure teams learn how to collapse persistent block storage performance and spatial capacity into a single dimension, all while providing deterministic performance. For this, I recommend watching the session I gave with several of my colleagues at Red Hat Summit last year.

Local SSD

Sometimes we need to access data fast, really fast, and the best way to realize that is with locally attached SSDs. Most clouds make this possible with special instance flavors, modeled after the i3 instances provided by Amazon EC2. In OpenStack, the equivalent would be instances where the hypervisor uses PCIe passthrough for NVMe devices. These devices are best leveraged by applications that handle their own replication and fault tolerance, good examples being Cassandra and Clustered Elasticsearch. Fast local devices are also useful for scratch space for intermediate shuffle data that doesn’t fit in memory, or even S3A buffer files.

GPUs

Machine learning frameworks like TensorFlow, Torch, and Caffe can all benefit from GPU acceleration. With the burgeoning popularity of these frameworks, it’s important that infrastructure cater to them by providing instances flavors infused with GPU goodness. In OpenStack, this can be accomplished by passing through entire GPU devices in a similar fashion detailed in the Local SSD section, or by using GPU virtualization technologies like Intel GVT-g or NVIDIA GRID vGPU. OpenStack developers have been diligently integrating these technologies, I’d recommend operations folk understand how to deploy them once these features mature.

Object storage

In both public and private clouds, deploying multiple analytics clusters backed by object storage is becoming increasingly popular. In the private cloud, a number of things are important to prepare a Ceph object store for data intensive applications.

Bucket sharding

Bucket sharding was enabled by default with the advent of Red Hat Ceph Storage 3.0. This feature spreads a bucket’s indexes across multiple shards with corresponding RADOS objects. This is great for increasing the write throughput of a particular bucket, but comes at the expense of LIST operations. This is because of the way index entries are interleaved, and needing to gather entries before replying to a LIST request. Today, the S3A filesystem client performs many LIST requests, and as such it is advantageous to disable bucket sharding with rgw_override_bucket index_max_shards set to 1.

Bucket indexes on SSD

The Ceph object gateway uses distinct pools for objects and indexes, and as such those pools can be mapped to different device classes. Due to the S3A filesystem client’s heavy usage of LIST operations, it’s highly recommended that index pools be mapped to OSDs sitting on SSDs for fast access. In many cases, this can be achieved even on existing hardware by using the remaining space on devices that house OSD journals.

Erasure coding

Due to the immense storage requirements of data platforms, erasure coding for the data section of objects is a no brainer. Compared to 3x replication that’s common with HDFS, erasure coding reduced the required storage by 50%. When tens of petabytes are involved, that amounts to big savings! Most folks will probably end up using either 4+2 or 8+3 and spreading chunks across hosts using ruleset-failure-domain=host.

Anatomy of the S3A filesystem client

Amazon introduced their Simple Storage Service (S3) in March 2006, which proved to be a watershed moment that ushered in the era of cloud computing services. It wasn’t long before folks started trying to use Amazon S3 in conjunction with Apache Hadoop; in fact, the first attempt was the S3 block filesystem, which was completed before the end of the year. This early integration stored data in a way that facilitated fast renames and deletes, but came at the expense of not being able to access data that had been written to S3 directly. Accessing data written by other applications was highly desirable, and by 2008 the S3N, or S3 Native Filesystem, was merged into Apache Hadoop. The following year, Amazon introduced Elastic MapReduce, which included Amazon’s own close-sourced S3 filesystem client.

Netflix was an early adopter of Elastic MapReduce, which was used to analyze data to improve streaming quality. This workload was one of three detailed in Amazon’s press release announcing Netflix’s intent to migrate a variety of applications to Amazon. Fast forward to 2013 and “any data set worth retaining” was stored in S3. To the best of my knowledge, this was the first public reference of a multi-cluster data platform that used S3 for shared storage. With all the chips on the table, Netflix and other cloud heavyweights started seriously thinking about how to better leverage the S3 API and improve S3 client performance. This led to the development of the successor to S3N, named S3A. The development of the S3A filesystem client has manifested as a series of phases and is still seeing loads of active development from the likes of Netflix, Cloudera, Hortonworks, and a whole host of others. Each phase of development is tracked in the Hadoop JIRA:

Downstream Hadoop distributions have done a terrific job of ensuring that juicy features are expeditiously backported, so even if your vendors distribution is based on an older Hadoop version, it’s likely that they are ready to consume data stored in S3.

Working with Ceph

Back in the early days of Ceph, we quickly came to the realization that it would be useful to allow folks to use the S3 API to interact with their Ceph storage infrastructure. By doing so, we’d be able to leverage the might of the Amazon ecosystem, including all the SDKs and tools written to interact with Amazon S3. The Ceph object gateway was conceived for this application and is an essential ingredient in providing private infrastructure operators a means to extend cloud object storage modalities into their data center. The S3 API has an ever-expanding set of calls, and keeping up requires a lot of hard work. To keep us honest, we actively develop a functional test suite affectionately named s3-tests. The test suite is so useful that it’s even been adopted by other storage vendors to ensure their products can have the same high level of fidelity with the S3 API—How flattering!

So Ceph and S3 are like two high school buddies, and that’s great. But what’s required to ensure maximum fidelity with Amazon S3?

Endpoints

The Amazon S3 API provides a high-level container for objects which, in S3 parlance, is called a bucket. All objects are stored in exactly one bucket, and each bucket is mapped to a single Amazon S3 region. If you send a GET request for an object that lives in a bucket in another region, you’ll get a 302 redirect to the proper region. PUT requests sent to the wrong region will fail and be provided with the endpoint of the region the bucket calls home. If you have multiple Ceph clusters and want to emulate this behavior, you can do so by having each cluster be a distinct zone, in a distinct zonegroup, but with a common realm. For more details, refer to Ceph multi-site documentation.

If data engineers have a level of understanding about different endpoints available to them, then they can set fs.s3a.endpoint in a properties file such that it directs requests to the correct Ceph cluster or Amazon S3 region.

If you want to be helpful and know that an analytics cluster will only be talking to a single Ceph cluster, and not Amazon S3, then you might set the fs.s3a.endpoint in the core-site.xml. If you plan on having multiple Ceph clusters, or communicating with a Ceph cluster and the public cloud, then you have a few options. One is to set a default f3.s3a.endpoint in the analytics clusters’ core-site.xml to either an Amazon S3 regional endpoint or a Ceph endpoint. Application owners can still override this default with a properties file.

With  S3A from Apache Hadoop 2.8.1 users can define different S3A properties for different buckets. This is handy for applications that might operate on data sets in one or more Ceph cluster or Amazon S3 region.

Access control

Now that there is a level of understanding around endpoints, we can move on to authentication and access control. Amazon S3 has evolved over the years to provide more flexible control over who has access to what. Coarsely, the evolution can be broken down into multiple approaches.

  1. S3 ACLs
  2. Bucket policy with S3 users
  3. STS issued temporary credentials

The first approach is table stakes for any storage system that wants to allow applications to use the S3 API to interact with them. Each user has an access key and a secret key which they use to sign requests. Buckets always belong to a single user, and buckets can be specified as either public or private. Public buckets is the only means of sharing data between users, which is pretty inflexible. Ceph has supported basic S3 ACLs since the Ceph object gateway was introduced, and it’s not uncommon to see this be the only means of providing access control with other systems that tout an S3 compatible API. These access and secret keys can be mapped to the fs.s3a.access.key and fs.s3a.secret.key parameters in core-site.xml, in a properties file submitted with an application, or the most secure option: storing them in encrypted files using the Hadoop Credentials API.

Ceph doesn’t stop there. With Red Hat Ceph Storage 3.0, based on upstream Luminous, we added support for bucket policy, which allows cross user bucket sharing. This means one user’s private bucket can be shared with another user through bucket policy.

This is all good and well, but sometimes you want to manage access control for groups of users, instead of having to update a bunch of bucket policies whenever a user needs to be removed from a group. Amazon S3 doesn’t yet have its own notion of groups. Instead, Amazon has IAM, which is a means of enforcing role-based access control. IAM supports users, groups, and roles. This means you can create policies for a group, instead of having policy entries for each individual user. Unfortunately, IAM groups cannot be principals in bucket policies. IAM roles are similar to a user, but they do not have credentials. IAM roles delegate to some other authentication provider, and that authentication provider decides if the request is allowed to assume a particular role.

This is where Amazon STS comes in. STS issues temporary authentication tokens, which are mapped to IAM roles, and that mapping is attested by an external identity provider. As such, the external credentials provider can attest to whether a particular user can receive a token that allows that user to assume a IAM role. The S3A filesystem client supports the use of these tokens by changing the S3A Credentials Provider and setting the fs.s3a.session.token parameter in addition to fs.s3a.access.key and fs.s3a.secret.key. The upstream community is currently working on STS and IAM support in the Ceph object gateway, which will bring all this goodness to folks interacting with Ceph object stores. If this is important to your organization, we’re interested in getting feedback that helps us prioritize STS actions like AssumeRoleWithSAML and AssumeRoleWithWebIdentity.

Bucket prefix vs. path

The AWS Java SDK for S3 default is to use bucket prefix notation when sending requests and, by extension, so does the S3A filesystem client. If your Ceph object gateway endpoint is s3.example.com, then requests are sent to bucket.s3.example.com. To ensure your Ceph storage infrastructure behaves the same way as Amazon S3, you’ll need to configure a few things on the infrastructure side. The first is including the rgw_dns_name parameter in the [rgw] or[global] block of your ceph.conf configuration file. The value of this parameter in this scenario would be s3.example.com. Now, in order for the client to resolve the bucket subdomain, you’ll also need a wildcard DNS record in the form of *.s3.example.com that resolves to your gateway virtual IP address. To support SSL with bucket prefix notation, you’ll need to use a certificate with a wildcard subject alternative name (SAN) wherever it is being used to terminate TLS.

If you’re not on the infrastructure side of things, and you want to consume Ceph object infrastructure where this machinery hasn’t been configured, then you can opt for path style access by setting fs.s3a.path.style.access to true. In this configuration, requests will be sent to s3.example.com/bucket instead of bucket.s3.example.com.

There is a third trick, which might be helpful in unusual scenarios, and that’s to create a bucket with uppercase characters. Because DNS isn’t case sensitive, the SDK automatically switches over to path style access.

Encryption

We talked a bit about SSL/TLS in the previous section, but only insofar as how to configure things on the infrastructure side. The S3A filesystem client will default to using SSL/TLS. Changing this behavior is done by switching the fs.s3a.connection.ssl.enabled parameter to false. This is useful in scenarios like testing, or when SSL isn’t yet configured for your Ceph storage infrastructure. In a production environment, the adage “dance like nobody’s watching, and encrypt like everyone is,” still applies. Check out this blog post for a thorough walk through on configuring SSL/TLS for you Ceph object storage system.

In addition to transport encryption, objects can be encrypted. Amazon S3 provides two categories of object encryption: (1) encryption at the client and (2) server-side encryption. The S3A filesystem client does not support using client-side encryption, but it does support all three varieties of server-side encryption. The three server-side encryption options are:

  • SSE-S3: Keys managed internally by Amazon S3
  • SSE-C: Keys managed by the client and passed to Amazon S3 to encrypt/decrypt requests
  • SSE-KMS: Keys managed through Amazon’s Key Management Service (KMS)

With the advent of  Red Hat Ceph Storage 3.0, we support the SSE-C flavor of server-side encryption. Configuring the S3A filesystem client to use it is a simple affair, involving only two parameters: (1) fs.s3a.server-side-encryption-algorithm should be set to SSE-C and (2) the value of fs.s3a.server-side-encryption.key should be set to your secret key. I suspect most folks will want to do this by way of a properties file.

Upstream Ceph also has support for SSE-KMS, which is intended to be integrated with OpenStack Barbican for key management. I imagine it’s only a matter of time before this is also supported in Red Hat Ceph Storage.

Fast upload

There are two main ways of performing writes with the S3A filesystem client. The default mode buffers uploads to disk before sending a request to Amazon S3; it’s important to make sure that the directory you are buffering to is large and fast. I prefer to specify a directory supported by a local SSD. Even with fast media, this can be expensive from an IO perspective, so an alternative is to use in memory buffers. The S3A filesystem client offers two in memory buffer options: (1) one using arrays and (2) one using byte buffers. You have to be careful when using memory buffers, because it’s easy to run out if you don’t properly size both your JVM and YARN containers. If the available memory permits you to use in memory buffers, they can be much faster than their disk-based brethren.

Giving it a try

There’s an easy way to learn how to use the S3A filesystem client with Ceph, and this section will walk you through setting up a development environment using Minishift. If you’ve never used Minishift before, it’s relatively painless to set up. The guides for installation on Mac OSX, Windows, and Linux can be found here.

Once you’ve installed Minishift on your local system, drop into a terminal and fire it up.

minishift start

Now that we have a Minishift environment running, we’re going to get Ceph Nano running so we can use it as a S3A endpoint. To get started, you’ll need to download a couple of YAML files: ceph-nano.yml and ceph-rgw-keys.yml. Once those files are downloaded to your working directory, we can use the oc command line utility to deploy Ceph Nano:

oc --as system:admin adm policy add-scc-to-user anyuid \
  system:serviceaccount:myproject:default
oc create -f ceph-rgw-keys.yml
oc create -f ceph-nano.yml
oc expose pod ceph-nano-0 --type=NodePort

You should have a Ceph Nano service running at this point and may be wondering how you’re going to interact with it. Jupyter Notebooks have become wildly popular in the analytics and data-science communities because of their ability to create a single artifact, the notebook, which contains both code and documentation. As if that wasn’t enough, you can interact with them from the comfort of your web browser. Here at Red Hat, we’ve been fostering an awesome community called radanalytics.io. They’re hard at work empowering intelligent application development on OpenShift. They’ve provided a base notebook application that I used as a starting point for playing with S3A from Jupyter, because its container image is neatly bundled with Spark and PySpark libraries. You can get Jupyter up and running in your Minishift environment with just a few commands:

oc new-app https://github.com/mmgaggle/ceph-notebook \
  -e JUPYTER_NOTEBOOK_PASSWORD=developer \
  -e RGW_API_ENDPOINT=$(minishift openshift service ceph-nano-0 --url) \
  -e JUPYTER_NOTEBOOK_X_INCLUDE=http://mmgaggle-bd.s3.amazonaws.com/ceph-notebook.ipynb

oc env --from=secret/ceph-rgw-keys dc/ceph-notebook
oc expose svc/ceph-notebook
oc status

The “oc status” command will provide a http://ceph-notebook-myproject.$(IP).nip.io URL. Load that into your browser and follow along with your browser!

HTTPS-ization of Ceph object storage public endpoint

Hypertext Transfer Protocol Secure (HTTPS) is the secure version of HTTP, uses encrypted communication between the user and the server. HTTPS avoids Man-in-the-Middle-Attack attacks by relying on Secure Socket Layer (SSL) and Transport Layer Security (TLS) protocols to establish an encrypted connection to shuttle data securely between a client and a server.

This blog post takes you step by step through the process of adding SSL/TLS security to Ceph object storage endpoints. Ceph is a scalable, open-source object storage solution that provides Amazon S3 and SWIFT compatible APIs you can use to build your own public or private cloud object storage solution. Learn more about Ceph at Red Hat Ceph Storage.

Setting up domain record sets

Let’s begin by setting up a domain name and configuring its record sets. If you already have a domain name you’d like to use for the Ceph endpoint, great. If not, you can purchase one from any domain registrar. In this example, we’ll use the domain name ceph-s3.com.

First, we need to add record sets to the domain such that the domain name can resolve into the IP address of the Ceph RADOS Gateway or the load balancer (LB) that is fronting your Ceph RADOS Gateways. To do this, you need to log in to your domain registrar dashboard and create two record sets:

  1. A type recordset with Domain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host. (If you’re using IPv6, choose AAAA type.)
  2. CNAME type recordset Wildcard Subdomain Name and IPv4 Address of your Internet-accessible LB or Ceph RGW host

Note: The reason we chose the wildcard subdomain name is important. We want to resolve all subdomains (e.g., bucket.ceph-s3.com) to the same IP address, because S3 treats subdomain prefixes as the bucket name.

Domain record set additions are highlighted in the following screenshot:

DNS changes usually take a few tens of minutes to propagate, once the changes are synced. We should be able to ping domain name as well as any subdomain from anywhere on the Internet, provided ICMP Ingress traffic is allowed on the host.

karasing-OSX:~$ ping ceph-s3.com
PING ceph-s3.com (54.209.4.196): 56 data bytes
64 bytes from 54.209.4.196: icmp_seq=0 ttl=40 time=1095.231 ms
64 bytes from 54.209.4.196: icmp_seq=1 ttl=40 time=1099.570 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1199.266 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1199.266 ms
^C
--- ceph-s3.com ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1095.231/1131.356/1199.266/48.053 ms
karasing-OSX:~$
karasing-OSX:~$ ping anynameintheuniverse.ceph-s3.com
PING 54.209.4.196 (54.209.4.196): 56 data bytes
64 bytes from 54.209.4.196: icmp_seq=0 ttl=40 time=1491.105 ms
64 bytes from 54.209.4.196: icmp_seq=1 ttl=40 time=1262.021 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1205.943 ms
64 bytes from 54.209.4.196: icmp_seq=2 ttl=40 time=1205.943 ms
^C
--- 54.209.4.196 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1205.943/1319.690/1491.105/123.352 ms
karasing-OSX:~$

SSL certificate: Installation and setup

A SSL certificate is a set of encrypted files that binds an organization’s identity, domain name, IP address, and cryptographic keys. Once these SSL certificates are installed on the host server, a secure connection is allowed between the host server and the client machine. In a client’s Internet browser, a green padlock will appear next to the URL as a visual cue to users that traffic is protected.

Some organizations might desire a more advanced certificate that requires additional validation. These SSL certificates must be purchased from a trusted Certificate Authority (CA). For the sake of demonstration in this example, we’ll use Let’s Encrypt which is a certificate authority that provides free X.509 certificates for Transport Layer Security (TLS) encryption. [ credits: [wikipedia](https://en.wikipedia.org/wiki/Let’s_Encrypt) & thanks Let’s Encrypt for your free service]

Next, we’ll install epel-release and certbot, a CLI tool for requesting SSL certificates, from Let’s Encrypt CA.

yum install -y http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install -y certbot

Request for SSL certificate using certbot CLI client

certbot certonly --manual -d ceph-s3.com -d *.ceph-s3.com --agree-tos
manual-public-ip-logging-ok --preferred-challenges dns-01 --server
https://acme-v02.api.letsencrypt.org/directory

Note: The first -d option in certbot CLI represent base domain, the subsequent -d options represents sub-domains.

Important: Make sure you are using wildcard (*) for subdomain, because we are requesting a wildcard subdomain SSL certification from Let’s Encrypt. If a certificate is only issued for base domain,  it will not be compatible with subdomain prefix notation.

The following snippet shows the output of certbot CLI command. The DNS challenge method will generate two DNS TXT records, which must be added as a TXT record set for your domain (from the domain registrar dashboard):

[root@ceph-admin ~]# certbot certonly --manual -d ceph-s3.com -d *.ceph-s3.com --agree-tos
--manual-public-ip-logging-ok --preferred-challenges dns-01 --server
https://acme-v02.api.letsencrypt.org/directory
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for ceph-s3.com
dns-01 challenge for ceph-s3.com

-------------------------------------------------------------------------------
Please deploy a DNS TXT record under the name
_acme-challenge.ceph-s3.com with the following value:

BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4   <== This Value

Before continuing, verify the record is deployed.

-------------------------------------------------------------------------------
Press Enter to continue

-------------------------------------------------------------------------------
Please deploy a DNS TXT record under the name
_acme-challenge.ceph-s3.com with the following value:

O-_g-eeu4cSI0xXSdrw3OBrWVgzZXJC59Xjkhyk39MQ    <== This Value

Before continuing, verify the record is deployed.

——————————————————————————-

Keep the certbot CLI command running, and open your domain registrar dashboard. Deploy a new DNS TXT record under the name _acme-challenge.yourdomain.com, and enter both these values (marked with <== in the preceding output) in two separate lines inside double quotes. The following screenshot shows how:

Keep the certbot CLI command running and, once you’ve successfully added DNS TXT record in your domain record set, open another terminal. Now, we’ll verify if DNS TXT records are applied on your domain by running the command host -t txt _acme-challenge.yourdomain.com. You should be able to see the same DNS TXT as mentioned in certbot CLI. If you do not, wait for a few minutes for DNS synchronization to occur.

karasing-OSX:~$ host -t txt _acme-challenge.ceph-s3.com
_acme-challenge.ceph-s3.com descriptive text
"BzL-LXXkDWwdde8RFUnbQ3fdYt5N6ZXELu4T26KIXa4"
_acme-challenge.ceph-s3.com descriptive text
"O-_g-eeu4cSI0xXSdrw3OBrWVgzZXJC59Xjkhyk39MQ"
karasing-OSX:~$

Once you’ve verified DNS TXT records are applied to your domain, return to certbot CLI and press Enter to continue. You will be notified that the system is waiting for verification and cleaning up challenges. You should then see the following message:

Congratulations! Your certificate and chain have been saved at:
/etc/letsencrypt/live/ceph-s3.com/fullchain.pem
Your key file has been saved at:
/etc/letsencrypt/live/ceph-s3.com/privkey.pem
Your cert will expire on 2018-10-07. To obtain a new or tweaked
version of this certificate in the future, simply run certbot
again. To non-interactively renew *all* of your certificates, run
"certbot renew"

If you like Certbot, please consider supporting our work by:

Donating to ISRG / Let's Encrypt:   https://letsencrypt.org/donate
Donating to EFF:                    https://eff.org/donate-le

[root@ceph-admin ~]#

At this point, SSL certificates have been issued to your domain/host, so now we must verify them.

[root@ceph-admin ~]# ls -l /etc/letsencrypt/live/ceph-s3.com/
total 4
lrwxrwxrwx 1 root root  35 Jul 9 19:06 cert.pem -> ../../archive/ceph-s3.com/cert1.pem
lrwxrwxrwx 1 root root  36 Jul 9 19:06 chain.pem -> ../../archive/ceph-s3.com/chain1.pem
lrwxrwxrwx 1 root root  40 Jul 9 19:06 fullchain.pem -> ../../archive/ceph-s3.com/fullchain1.pem
lrwxrwxrwx 1 root root  38 Jul 9 19:06 privkey.pem -> ../../archive/ceph-s3.com/privkey1.pem
-rw-r--r-- 1 root root 682 Jul  9 19:06 README
[root@ceph-admin ~]#

Installing LB for object storage service

Many like HAProxy, because it’s easy and does its job well. In this example, we will use HAProxy to perform SSL termination for our domain name (Ceph object storage endpoint).

Note: Starting with Red Hat Ceph Storage 2, Ceph RADOS Gateway natively supports TLS by relying on OpenSSL Library. You can get more information on native SSL/TLS configuration here. In this example, we specifically choose to terminate SSL at HAProxy level. This gives us an advantage like when we have multiple instances of Ceph RGW we do not need to get multiple domain names/SSL certificates for each of them. One domain name with SSL termination at LB does the job.

Next, we’ll install HAproxy on the same host whose public IP has been bound with your domain name, in this case, ceph-s3.com.

yum install -y haproxy

Next, we’ll create a certs directory.

mkdir -p /etc/haproxy/certs

Combine certificate files fullchain.pem and privkey.pem into a single file for our domain next.

DOMAIN='ceph-s3.com' sudo -E bash -c 'cat /etc/letsencrypt/live/$DOMAIN/fullchain.pem
/etc/letsencrypt/live/$DOMAIN/privkey.pem > /etc/haproxy/certs/$DOMAIN.pem'

The next step is to change the permission of the certs directory.

chmod -R go-rwx /etc/haproxy/certs

Optionally, you can move the haproxy.cfg original file and create a new config file with the following configuration settings:

mv /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.orig
vim /etc/haproxy/haproxy.cfg
global
    log        127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    tune.ssl.default-dh-param 2048     
    stats socket /var/lib/haproxy/stats
defaults
    mode                    http
    log                    global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    option httpchk HEAD /
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000
frontend www-http
   bind 10.100.0.10:80
   reqidel                      ^X­Forwarded­For:.*
   reqadd X-Forwarded-Proto:\ http
   default_backend www-backend
   option  forwardfor
frontend www-https
   bind 10.100.0.10:443 ssl crt /etc/haproxy/certs/ceph-s3.com.pem
   reqadd X-Forwarded-Proto:\ https
   acl letsencrypt-acl path_beg /.well-known/acme-challenge/
   use_backend letsencrypt-backend if letsencrypt-acl
   default_backend www-backend
backend www-backend
   redirect scheme https if !{ ssl_fc }
   server ceph-admin 10.100.0.10:8081 check inter 2000 rise 2 fall 5
backend letsencrypt-backend
   server letsencrypt 127.0.0.1:54321

As you may have noted, we’re using a couple of non-default parameters in the haproxy config file, such as:

  • tune.ssl.default-dh-param is required to provide OpenSSL the necessary parameters for the SSL/TLS handshake.
  • frontend www-http binds haproxy to port 80 of the local machine and redirects traffic to the default backend www-backend. If a user uses HTTP protocol in the request, it should redirect to HTTPS.
  • frontend www-https binds haproxy to port 443 of the local machine. It also redirects traffic to the default backend www-backend and uses the SSL certificate path to encrypt/terminate the traffic.
  • frontend www-https also uses letsencrypt-backend if you want to auto-renew the SSL certificate from Let’s Encrypt CA.
  • backend www-backend simply redirects all the SSL terminated traffic to Ceph RGW node 10.100.0.10:8081. For HA and performance, you must have multiple Ceph RGW instances whose IPs should be added in the same backend section so that HAProxy can load balance among Ceph RGW instances.

Finally, we restart HAproxy and verify its listening on ports 80 and 443:

systemctl start haproxy
systemctl status haproxy ; netstat -plunt | grep -i haproxy

Configuring Ceph RGW

Up to this point, we’ve configured the domain name, set required record sets, generated SSL certificate, configured HAProxy to encrypt/terminate SSL, and redirected traffic to the Ceph RGW instance. Now we need to configure Ceph RGW to listen on port 8081 as configured in HAproxy. To do so, in the Ceph RGW node edit /etc/ceph/ceph.conf and update the client.rgw section as shown in the following:

[client.rgw.ceph-admin]
host = ceph-admin
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-admin/keyring
log file = /var/log/ceph/ceph-rgw-ceph-admin.log
rgw frontends = civetweb port=10.100.0.10:8081 num_threads=512
rgw resolve cname = true
rgw dns name = ceph-s3.com

Note: rgw resolve cname = true forces rgw to use the DNS CNAME record of the request hostname field (if the hostname is not equal to rgw dns name).

Note: rgw dns name = ceph-s3.com is the DNS name of the served domain.

Now, we’ll restart the Ceph RGW instance and verify its listening on port 8081.

systemctl restart ceph-radosgw@rgw.ceph-admin.service
netstat -plunt | grep -i rados

Accessing Ceph object storage secure endpoint

To test the HTTPS-enabled Ceph object storage URL, execute the following curl command or type https://ceph-s3.com in any web browser:

curl https://ceph-s3.com

It should yield output like the following:

[student@ceph-admin ~]$ curl https://ceph-s3.com
anonymous
[student@ceph-admin ~]$
[student@ceph-admin ~]$

Let’s try accessing Ceph object storage using S3cmd:

yum install -y s3cmd

Configure S3cmd CLI by providing config options like access/secret keys, Ceph S3 secure endpoints in host/host-bucket parameters.

s3cmd --access_key=S3user1 --secret_key=S3user1key --host=ceph-s3.com
--host-bucket="%(bucket)s.ceph-s3.com" --dump-config > /home/student/.s3cfg

Note: By default, s3cmd uses HTTPS connection, so there is no need to explicitly specify that.

Next, we’ll interact with Ceph object storage using s3cmd ls, s3cmd mb commands:

[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
[student@ceph-admin ~]$
[student@ceph-admin ~]$ s3cmd mb s3://secure_bucket
Bucket 's3://secure_bucket/' created
[student@ceph-admin ~]$ s3cmd ls
2018-07-09 19:13  s3://container-1
2018-07-09 19:13  s3://public_bucket
2018-07-09 19:55  s3://secure_bucket
[student@ceph-admin ~]$

Congratulations! You’ve successfully secured your Ceph object storage endpoint using you domain name of your choice and SSL certificates.

Closing

As you can see, acquiring and setting up SSL certificates involves some careful configuration and depends on your chosen CA (how Easy & Fast to acquire SSL certificate). With initiatives like “HTTPS Everywhere,” it’s no longer just web sites hosting deliverable content that must have SSL; API and service endpoints should also offer encrypted transport.

Note: HTTPS is designed to prevent eavesdropping and Man-in-the-Middle-Attacks. Always practice defense in depth. Multiple layers of security are needed to more fully secure your web site/service endpoints.

Why are customers choosing Red Hat’s Container-Native Storage in the public cloud with OpenShift?

By Sayandeb Saha, Director, Product Management, Storage Business Unit

In our last blog post in this series, we talked about how the Container-Native Storage (CNS) offering for OpenShift Container Platform from Red Hat has seen increased customer adoption in on-premise environments by offering a peaceful coexistence approach with classic storage arrays that are not deeply integrated with OpenShift. In this post, we’ll explore why many customers are deploying our CNS offering in the three big public clouds—AWS, Microsoft Azure, and the Google Cloud Platform—on top of native public cloud offerings from the public clouds—despite good integration of Kubernetes with native storage offerings in the cloud. Let’s examine some of these problems and constraints in a bit more detail and describe how CNS addresses them.

Slow attach/detachpoor availability

The first issue stems from the fact that the native block storage offerings (EBS in AWS, Data Disk in Azure, Persistent Disk in Google Cloud) in the public cloud were designed and engineered to support virtual machine (VM) workloads. In such workloads, attaching and consequently detaching a block device to a machine image/instance is an infrequent occurrence at best, as these workloads are less dynamic compared to Platform-as-a-Service (PaaS) and DevOps workloads, which frequently run on OpenShift powering dynamic build and deploy CI/CD pipelines and other similar workloads and workflows.

Some of our customers found that attach and detach times for these block devices, when directly accessed from OpenShift workloads using the native kubernetes storage provisioners, are unacceptable because they led to poor startup times for pods (slow attach) and limited or no high availability on a failover, which usually triggers a sequence that includes a detach operation, an attach operation, and a subsequent mount operation.

Each of these operations usually triggers a variety of API calls specific for the public cloud provider. Any or all of these intermediate steps can fail, causing users to lose access to storage persistent volumes (PVs) for their compute pods for an extended period. Overlaying Red Hat’s CNS offering as a storage management fabric to aggregate, pool, and serve out PVs expediently without worrying about the status of individual cloud native block storage (a.k.a EBS or Azure Data Disk) can provide major relief, because it effectively isolates the lifecycle of cloud-native block storage devices from that of the application pods allocating and deallocating PVs dynamically as application teams work on OpenShift. This isolation effectively addresses this issue.

Block device limits per compute instance

The second issue some of our customers run into is the fact that there is a limit to the number of block devices that one can attach to the machine images or instances in various public cloud environments.

OpenShift supports a maximum of 250 containers per host. The maximum number of block devices that are supported to be attached to machine instances per account is far fewer (for example, max 40 EBS devices per EC2 instance). Even though it is unusual to have a 1:1 mapping between containers and storage devices, this low maximum can lead to a lot of unintended behavior, notwithstanding the fact that it leads to a higher total cost of ownership (need more hosts than necessary).

For example, in a failover scenario during the detach, attach, and mount sequence, the API call to attach might fail, because there are already a maximum number of devices attached to the EC2 instance where this attempt is being made, which can cause a glitch/outage. Overlaying Red Hat’s CNS offering as a storage management fabric on cloud-based block devices mitigates the impact of hitting the maximum number of devices that can be attached to a machine image or instance, because storage is served out from a pool that is unencumbered by individual max device per instance/host limit. Storage can continue to be served out until the entire pool is exhausted which, at that time, can be expanded by adding new hosts and devices.

Cross-AZ storage availability

The third issue arises from the fact that cloud block storage devices are usually accessible within a specific Availability Zone (AZ) in AWS or Availability Sets in Azure. AZs are like failure domains in public clouds.

Most customers who deploy OpenShift in the public cloud do so to span more than one AZ for high availability. This is done so that when one AZ dies or goes offline, the OpenShift cluster remains operational. Using block devices constrained to an AZ for providing storage services to OpenShift workloads can defeat the purpose, because then containers must be scheduled within hosts that belong to the same AZ, and customers can not leverage the full power of Kubernetes orchestration. This configuration could also lead to an outage when an AZ goes offline.

Our customers use CNS to mitigate this problem so that even when there is an AZ failure, a three-way replicated cross-AZ storage service (CNS) is available for containerized applications to avoid downtimes. This also enables Kubernetes to schedule pods across AZs (instead of within an AZ), thereby preserving the spirit of the original fault-tolerant OpenShift deployment architecture that spans multiple AZs.

Cost-effective storage consolidation

Storage provided by CNS is efficiently allocated and offers performance with the first gigabyte provisioned, thereby enabling storage consolidation. For example, consider six MySQL database instances, each in need of 25 GiB of storage capacity and up to 1500 IOPS at peak load. With EBS in AWS, one would create six EBS volumes, each with at least 500 GiB capacity out of the gp2 (General Purpose SSD) EBS tier, in order to get 1500 IOPS. The level of performance is tied to provisioned capacity with EBS.

With CNS, one can achieve the same level using only 3 EBS volumes at 500 GiB capacity from the gp2 tier and run these with GlusterFS. One would create six 25 GiB volumes and provide storage to many databases with high IOPS performance, provided they don’t peak all at the same time. Doing that, one would halve EBS cost and still have capacity to spare for other services. Read IOPS performance is likely even higher, because in CNS with three-way replication as data is read from distributed across 3×1500 IOPS gp2 EBS volumes.

Check us out for more

As you can see, there’s a good case to be made for using CNS in various public clouds for a multitude of technical reasons our customers care about, besides the fact that Red Hat CNS provides a consistent storage consumption and management experience across hybrid and multi clouds (see the following figure).

 

Red Hat CNS runs anywhere and everywhere Red Hat OpenShift Container Platform runs.

In addition to the application portability that OpenShift already provides across hybrid and multi clouds, we’re working on multi cloud replication features that would enable CNS to effectively become the data fabric that enables data portability—another good reason to select and stay with CNS. Stay tuned for more information on that!

For hands-on experience now combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both.

What about locality?

This is the first post of a multi-part series of technical blog posts on Spark on Ceph:

  1. What about locality?
  2. Anatomy of the S3A filesystem client
  3. To the cloud!
  4. Storing tables in Ceph object storage
  5. Comparing with HDFS—TestDFSIO
  6. Comparing with remote HDFS—Hive Testbench (SparkSQL)
  7. Comparing with local HDFS—Hive Testbench (SparkSQL)
  8. Comparing with remote HDFS—Hive Testbench (Impala)
  9. Interactive speedup
  10. AI and machine learning workloads
  11. The write firehose

Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:

“Will performance suck because the benefits of locality are lost?”

It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.

Historical context

Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.

Cluster of 1800 machines, [each with]:

  • 4GB of memory
  • Dual-processor 2 GHz Xeons with hyperthreading
  • Dual 160GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth of 100 Gb/s

If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.

The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.

Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.

While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.

Advantages

When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.

Disadvantages

Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.

One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.

Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.

In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.

In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.

Closing

Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.