How to back up and recover Red Hat OpenShift Container Storage

By Annette Clewett and Luis Rico

The snapshot capability in Kubernetes is in tech preview at present and, as such, backup/recovery solution providers have not yet developed an end-to-end Kubernetes volume backup solution. Fortunately, GlusterFS, an underlying technology behind Red Hat OpenShift Container Storage (RHOCS), does have a mature snapshot capability. When combined with enterprise-grade backup and recovery software, a robust solution can be provided.

This blog post details how backup and restore can be done when using RHOCS via GlusterFS. As of the Red Hat OpenShift Container Platform (OCP) 3.11 release, there are a limited number of storage technologies (EBS, Google Cloud E pDisk, and hostPath) that support creating and restoring application data snapshots via Kubernetes snapshots. This Kubernetes snapshot feature is in tech preview, and the implementation is expected to change in concert with upcoming Container Storage Interface (CSI) changes. CSI, a universal storage interface (effectively an API) between container orchestrators and storage providers, is ultimately where backup and restore for OCS will be integrated in the future using volume snapshots capability.

Traditionally, backup and restore operations involve two different layers. One is the application layer. For example, databases like PostgreSQL have their own procedures to do an application consistent backup. The other is the storage layer. Most storage platforms provide a way for backup software like Commvault or Veritas NetBackup to integrate, obtain storage level snapshots, and perform backups and restores accordingly. An application layer backup is driven by application developers and is application specific. This study will focus on traditional storage layer backup and restore using Commvault Complete™ Backup and Recovery Software for this purpose. Other backup software tools can be used in a similar manner if they supply the same capabilities as used with Commvault.

RHOCS can be deployed in either converged mode or independent mode, and both are supported by the process described in this article. Converged mode, formerly known as Container Native Storage (CNS), means that Red Hat Gluster Storage is deployed in containers and uses the OCP host storage and networking. Independent mode, formerly known as Container Ready Storage (CRS), is deployed as a stand-alone Red Hat Gluster Storage cluster that provides persistent storage to OCP containers. Both modes of RHOCS deployment use heketi in a container on OCP for provisioning and managing GlusterFS volumes.

Storage-level backup and restore for RHOCS

If a backup is performed at the Persistent Volume (PV) level, then it will not capture the OCP Persistent Volume Claim (PVC) information. OCP PVC to PV mapping is required to identify which backups belong to which application. This leaves a gap: Which GlusterFS PV goes with which OCP PVC?

In a traditional environment, this is solved by naming physical volumes such that the administrator has a way of identifying which volumes belong to which application. This naming method can now be used in OCP (as of OCP 3.9) by using custom volume naming in the StorageClass resource. Before OCP 3.9, the names of the dynamically provisioned GlusterFS volumes were auto-generated with random vol_UUID naming. Now, by adding a custom volume name prefix in the StorageClass, the GlusterFS volume name will include the OCP namespace or project as well as the PVC name, thereby making it possible to map the volume to a particular workload.

OCS custom volume naming

Custom volume naming requires a change to the StorageClass definition. Any new RHOCS persistent volumes claimed using this StorageClass will be created with a custom volume name. The custom volume name will have prefix, project or namespace, PVC name and UUID (<myPrefix>_<namespace>_<claimname>_UUID).

The following glusterfs-storage StorageClass has custom volume naming enabled by adding the volumenameprefix parameter.

# oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: glusterfs-storage
parameters:
  resturl: http://heketi-storage-app-storage.apps.ocpgluster.com
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: app-storage
  volumenameprefix: gf 
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete

❶ Custom volume name support: <volumenameprefixstring>_<namespace>_<claimname>_UUID

As an example, using this StorageClass for a namespace of mysql1 and PVC name of mysql the volume name would be gf_mysql1_mysql_043e08fc-f728-11e8-8cfd-028a65460540 (the UUID portion of the name will be unique for each volume name).

Note: If custom volume naming cannot be used, then it is important to collect information about all workloads using PVCs, their OCP PV associated, and the GlusterFS volume name (contained in Path variable in description of OCP PV).

RHOCS backup process

The goal of this blog post is to provide a generic method to back up and restore OCS persistent volumes used with OCP workloads. The example scripts and .ini files have no specific dependency on a particular backup and restore product. As such, they can be used with a product such as Commvault, where the scripts can be embedded in the backup configuration. Or, they can be used standalone, assuming that basic backup/recovery of the mounted gluster snapshots will be done via standard RHEL commands.

Note: The methods described here apply only for gluster-file volumes and currently will not work for gluster-block volumes.

For this approach, a “bastion host” is needed for executing the scripts, mounting GlusterFS snapshot volumes, and providing a place to install the agent if using backup and restore software. The bastion host should be a standalone RHEL7 machine separate from the OCP nodes and storage nodes in your deployment.

Requirements for the bastion host

The bastion host must have network connectivity to both the backup and restore server (if used), as well as the OCP nodes with the gluster pods (RHOCS converged mode) or the storage nodes (RHOCS independent mode). The following must be installed or downloaded to the bastion host:

  • backup and restore agent, if used
  • heketi-client package
  • glusterfs-fuse client package
  • atomic-openshift-clients package
  • rhocs-backup scripts and .ini files

RHOCS backup scripts

The github repository rhocs-backup contains unsupported example code that can be used with backup and restore software products. The two scripts, rhocs-pre-backup.sh and rhocs-post-backup.sh, have been tested with Commvault Complete™ Backup and Recovery Software. The rhocs-pre-backup.sh script will do the following:

  • Find all gluster-file volumes using heketi-client
  • Create a gluster snapshot for each volume
  • Mount the the snapshot volumes on the bastion host that has the backup agent installed
  • Protect the heketi configuration database by creating a json file for the database in the backup directory where all gluster snapshots are going to be mounted

Once the mounted snapshot volumes have been backed up, the rhocs-post-backup.sh script will do the following:

  • Unmount the snapshot volumes
  • Delete the gluster snapshot volumes

The two .ini files, independent_vars.ini and converged_vars.ini, are used to specify parameters specific to your RHOCS mode of deployment. Following are example parameters for converged_var.ini.

## Environment variables for RHOCS Backup:
## Deployment mode for RHOCS cluster: converged (CNS) or independent (CRS)
export RHOCSMODE="converged"

## Authentication variables for accessing OpenShift cluster or
## Gluster nodes depending on deployment mode
export OCADDRESS="https://master.refarch311.ocpgluster.com:443"
export OCUSER="openshift"
export OCPASS="redhat"
export OCPROJECT="app-storage" ## OpenShift project where gluster cluster lives

## Any of the Gluster servers from RHOCS converged cluster
## used for mounting gluster snapshots
export GLUSTERSERVER=172.16.31.173

## Directory for temporary files to put the list of
## Gluster volumes /snaps to backup
export VOLDIR=/root
export SNAPDIR=/root

## Destination directory for mounting snapshots of Gluster volumes:
export PARENTDIR=/mnt/source

## Heketi Route and Credentials
export USERHEKETI=admin ## User with admin permissions to dialog with Heketi
export SECRETHEKETI="xzAqO62qTPlacNjk3oIX53n2+Z0Z6R1Gfr0wC+z+sGk=" ## Heketi user key
export HEKETI_CLI_SERVER=http://heketi-storage-app-storage.apps.refarch311.ocpgluster.com ##
Route where Heketi pod is listening

## Provides Logging of this script in the dir specified below:
export LOGDIR="/root"

The pre-backup script, when executed, uses the heketi-client for the list of current gluster-file volumes. Because of this, for the script to work properly the heketi container must be online and reachable from bastion host. Additionally, for the scripts to work properly, all GlusterFS nodes or peers of RHOCS cluster must be online, as GlusterFS snapshot operation requires all bricks of a GlusterFS volume be available.

Manual execution of pre- and post-backup scripts

This section assumes that the bastion host has been created and has the necessary packages, scripts, and .ini files are installed on this machine. Currently, the pre- and post-backup scripts run as the root user. Because of this, backing up volumes for RHOCS independent mode will require that the bastion host can SSH as the root user with passwordless access to one of the GlusterFS storage nodes. This access should be verified before attempting to run the following scripts.

The scripts can be manually executed in the following manner for RHOCS converged mode:

sudo ./rhocs-pre-backup.sh /<path_to_file>/converged_vars.ini

Followed by this script to unmount the snapshot volumes and to remove the snapshot volumes from the RHOCS Heketi database and GlusterFS converged cluster:

sudo ./rhocs-post-backup.sh /<path_to_file>/converged_vars.ini

A variation of these scripts for RHOCS independent mode can be manually executed in the following manner:

sudo ./rhocs-pre-backup.sh /<path_to_file>/independent_vars.ini

Followed by this script to unmount the snapshot volumes and to remove the snapshot volumes from the RHOCS Heketi database and GlusterFS independent cluster:

sudo ./rhocs-post-backup.sh /<path_to_file>/independent_vars.ini

For each execution of the pre- or post-backup script a log file will be generated and placed in the directory specified in the .ini file (default is /root).

Note: Pre-backup scripts can be modified as needed for specific scenarios to achieve application-level consistency, like quiescing a database before taking a backup. Also, if special features are used with RHOCS, like SSL encryption or geo-replication, scripts will have to be customized and adjusted to be compatible with those features.

Commvault backup process

Note that this blog post does not cover the tasks to install and configure Commvault to back up and restore data. In addition to having the Commvault Console and Agent in working order, this section also assumes that the bastion host has been created and has the necessary packages, scripts, and .ini files installed.

The use of these scripts to back up OCP PVs is compatible with any backup frequency or retention configured in the Commvault backup policy. But, as we are mounting gluster snapshots in newly created folders with date and time information, the backup application will always consider contents as new, so even if backup policy is incremental, it will effectively do a full backup. Also, the backup content will consist of dozens or even hundreds of very small filesystems (1-10 GB), that could run faster under a “always do full backup” strategy.

Detailed process for backup using Commvault

Once the scripts and .ini files are on the bastion host and a Commvault Agent is installed, a backup can be done using the Commvault Commcell Console or the Commvault Admin Console. The following views show how to do the backup using the Commcell Console and validate the backup using the Admin Console.

A Subclient must be created before a backup can be done and a unique name must be specified.

When creating a Subclient, you must input where on the bastion host you want the backup to be done from on the bastion host with Commvault Agent (e.g., /mnt/source).

Choose what schedule you want the backup done on (or Do Not Schedule; start backup manually using Console instead).

And last or the Subclient configuration, add the path to the pre- and post-backup scripts, as well as the path to the appropriate .ini file, converged_vars.ini or independent_vars.ini. Once this is done and the subclient has been saved, you are ready to take a backup of the gluster-file snapshot volumes.

The backup can then be done by selecting the desired subclient and issuing an immediate backup or letting the selected schedule do the backups when configured (e.g., daily).

For an immediate backup, you can choose full or incremental. As already stated, a full backup will be done every time, because the pre-backup script always creates a new directory to mount the gluster-file snapshot volume.

You can track backup progress using the Job Controller tab.

Once the backup is complete, in the Job Controller tab of the Commvault Commcell Console, verification can be done by logging into the Commvault Admin Console, selecting the correct subclient (ocsbackups), and viewing the backup content for the GlusterFS volumes and the heketi database.

RHOCS restore and recovery process

Now that there are backups for RHOCS snapshot volumes, it is very important to have a process for restoring the snapshot from any particular date and time. Once data is restored, it can be copied back into the OCP PV for a target workload in a way that avoids conflicts (e.g., copying files into running workload at same time updates are attempting to be made to same files). 

To do this, we will use CLI command “oc rsync” and an OCP “sleeper” deployment. You can use the command “oc rsync” to copy local files to or from a remote directory in a container. The basic syntax is “oc rsync <source> <destination>”. The source can be a local directory on the bastion host or it can be a directory in a running container, and similar is true for the destination.

In the case where the data in a PV must be completely replaced, it is useful to use a “sleeper” deployment as a place to restore the data so that the workload can essentially be turned off while the data is being restored to its volume, thereby avoiding any conflicts. The sleeper deploymentconfig will create a CentOS container and mount the workload PVC at a configured mount point in the CentOS container. This allows the backup gluster snapshot for the PV to then be copied from the directory where the snapshot was restored into the sleeper container (e.g., oc rsync <path to directory with restored data>/ sleeper-1-cxncv:/mnt –delete=true).

Simple restore or file-level recovery

This section details how to restore files or folders from a backup of a particular volume to a local working directory on the bastion host.

  1. Identify the desired backup of the gluster snapshot volume by date and time and volume name (see the following in Commvault subclient Restore view). Find the folder or files you want to restore, and check the appropriate boxes.

2. Restore the files to the same directory where the backup was taken or to any other directory on the bastion host.

3. Verify that the files are in the specified directory for the Commvault Restore. They can now be copied into the
destination pod using “oc rsync” or the method described in the next section using a “sleeper” deployment.

$ pwd
/home/ec2-user/annette
$ ls -ltr test*
-rw-r-----. 1 1000470000 2002  8590 Dec 3 17:19 test.frm
-rw-r-----. 1 1000470000 2002 98304 Dec  3 17:19 test.ibd

Complete restore and recovery

This section details how to restore an entire volume. The process tested here will work for operational recovery (volumes with corrupted data), instances where volumes were inadvertently deleted, or to recover from infrastructure failures. The following example is for a MySQL database deployed in OCP.

  1. Identify the desired backup by date, time, and volume name (see the backup directory that is checked in Commvault Subclient Restore view).

2. Restore the backup to original directory where backup was taken of mounted gluster snapshot volume or any other directory on the bastion host.

# ls

/mnt/source/backup-20181203-1732/ocscon_mysql3_mysql_f3610b3f-f122-11e8-b862-
028a65460540-snap-20181203-1732
auto.cnf    ca.pem     client-key.pem  ibdata1  ib_logfile1  mysql   
      mysql_upgrade_info  private_key.pem sampledb      server-key.pem
ca-key.pem  client-cert.pem  ib_buffer_pool ib_logfile0  ibtmp1 
mysql-1-gs92m.pid  performance_schema public_key.pem   server-cert.pem sys

3. Change to the correct namespace or project (oc project msyql3).

4. Scale the mysql deploymentconfig to zero to temporarily stop the database service by deleting the mysql pod (oc scale –replicas=0 dc mysql).

5. Create a sleeper deployment/pod (oc create -f sleeper-dc.yml). The YAML file to create the “sleeper deployment/pod” can be found in the next section.

6. Copy the backup to the PVC mounted in sleeper pod (oc rsync <path to directory with restored data>/ sleeper-1-cxncv:/mnt –delete=true).

# oc rsync
/mnt/source//backup-20181203-1732/ocscon_mysql3_mysql_f3610b3f-f122-11e8-b862-
028a65460540-snap-20181203-1732/ sleeper-1-f9lgh:/mnt --delete=true

Note: Disregard this message: WARNING: cannot use rsync: rsync not available in container.

7. Delete sleeper deploymentconfig (oc delete dc/sleeper)

8. Scale up the mysql deploymentconfig to recreate the mysql pod and start the service again. This will mount the mysql volume with the restored data (oc scale –replicas=1 dc mysql).

9. Log in to the mysql pod and confirm the correct operation of the database with the restored data.

Creating the Sleeper DeploymentConfig

Following is the YAML file to create the sleeper deployment (oc create -f dc-sleeper.yaml). This deployment must be created in same namespace or project as the workload you are trying to restore data to (e.g., the mysql deployment).

Note: The only modification needed for this YAML file it to specify the correct <pvc_name> below (e.g., mysql).

$ cat dc-sleeper.yaml
---
apiVersion: apps.openshift.io/v1
kind: DeploymentConfig
metadata:
  annotations:
    template.alpha.openshift.io/wait-for-ready: "true"
  name: sleeper
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    name: sleeper
  strategy:
    activeDeadlineSeconds: 21600
    recreateParams:
      timeoutSeconds: 600
    resources: {}
    type: Recreate
  template:
    metadata:
      labels:
       name: sleeper
spec:
     containers:
       - image: centos:7
          imagePullPolicy: IfNotPresent
          name: sleeper
          command: ["/bin/bash", "-c"]
          args: ["sleep infinity"]
          volumeMounts:
            - mountPath: /mnt
              name: data
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 5
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: <pvc_name>
  test: false
  triggers:
    - type: ConfigChange

SQL Server database point-in-time recovery example

To make sure there are no updates during the gluster-file snapshot, the following must be done for a mysql volume so that the backup is consistent.

  1. Log in to the mysql pod.
  2. Log in to mysql (mysql -u root).
  3. mysql>USE SAMPLEDB;
  4. mysql> FLUSH TABLES WITH READ LOCK;
  5. Take a gluster snapshot of the mysql volume by executing the rhocs-pre-backup.sh script manually for the correct RHOCS mode.
  6. mysql> UNLOCK TABLES;
  7. Remove the pre- and post-backup scripts in the Advanced tab for the Commvault subclient.
  8. Take a backup of the snapshot volume.
  9. Execute the rhocs-post-backup.sh script manually for the correct RHOCS mode (unmount and delete the gluster snapshot volumes).
  10. Continue with step 2 in the the “Complete restore and recovery” section.

Backup scripts on Github

Scripts and .ini files can be found here. You are more than welcome to participate in this effort and improve the scripts and process.

Want to learn more about Red Hat OpenShift Container Storage?

Get a more intimate understanding of how Red Hat OpenShift and OCS work together with a hands-on test drive, and see for yourself.

Still want to learn more? Check out the Red Hat OpenShift Container Storage datasheet.

KubeCon Seattle, here we come!

Our top 3 storage-for-containers things to look forward to at KubeCon

By Steve Bohac, OpenShift Storage Product Marketing

Season greetings!

As always, much going on with Red Hat OpenShift Container Storage!

Of course, Kubernetes 1.13 was released this week, Container Journal recently published an article I authored, and KubeCon Seattle is coming up next week… By the way, did you see the latest Forrester Wave Enterprise Container Platform Software Suites where Red Hat OpenShift was named a Leader? Good stuff!

Red Hat OpenShift Container Storage helps organizations standardize storage across multiple environments and easily integrates with Red Hat OpenShift to deliver a persistent storage layer for containerized applications that require long-term, stateful storage. Enterprises can benefit from a simple, integrated solution including the container platform, registry, application development environment, and storageall in one, supported by a single vendor.

December is always a busy month with industry conferences (not to mention holiday planning!), so as I finalized my own KubeCon plans, I wanted to pause and take a quick breath and outline my top 3 things I’m looking forward to at KubeCon Seattle 2018 next week:

  1. Assorted Kubernetes announcements (whatever they are!). Yes, who knows what kind of interesting things will be announced next week… but they’ll likely be exciting! The Kubernetes ecosystem has gotten so large now, there is always a plethora of interesting products and technologies announced at KubeCon. It’s always interesting to see how these new announcements dictate where things are going with Kubernetes and cloud native technologies in general. (By the way, for a great overview of the “third era” of Kubernetes, check out PodCTL #54 with our own Brian Gracely and Tyler Britten.)
  2. For the first time ever, there will be a Cloud Native Storage Day as one of the co-located events at KubeCon. Like the other co-located events, it takes place next Monday before the KubeCon show officially kicks off. The day’s agenda includes customers and industry leaders like Red Hat (I’ll be there with a few colleagues presenting) discussing current implementations and future directions of container storage. This should be very educational and interactive for everyone! And…. the sessions will be recorded (look back here for a post-KubeCon blog after the show for links to the recordings!).
  3. Catching up on the status of the Rook project. What is Rook? Rook is a persistent storage orchestrator that is designed to run as a native Kubernetes service. Consider it the glue between storage and the containerthe thing that makes automation work. This is an interesting development around storage for containers, and I’m looking forward to meeting up with colleagues and “fellow travelers” to understand more.

Anyway, it should be a good one at KubeCon next week (did I mention it is sold out!?). In between sessions, make sure to visit us in Booth D1 in the Expo Hall for product demonstrations, to speak with Red Hat OpenShift Container Storage experts and other community leaders about upstream projects, and to snag some of our giveaways (while supplies last!).

We hope to see you there! If we don’t catch you in person, we’ll be tweeting (and re-tweeting) all week! If you don’t already, make sure to follow us on Twitter at @RedHatStorage.

Not attending KubeCon? No sweat! You can still learn more and get hands on with a more intimate understanding of how Red Hat OpenShift and OpenShift Container Storage work together with a test drive.

Still want to learn more? Check out the Red Hat OpenShift Container Storage datasheet.

Red Hat Hyperconverged Infrastructure for Virtualization delivers increased efficiencies for storage and compute at the edge

Customers can realize more value and greater simplicity with cost-effective, open source, integrated compute and storage delivered in a compact footprint

By Daniel Gilfix, Red Hat Cloud Storage and Hyperconverged Infrastructure

Hyperconverged Infrastructure (HCI) emerged as an infrastructure category about a decade ago aimed at a few specific use cases and has been dominated by proprietary software vendors offering appliances built on their hardware, or rigid configurations delivered with OEM hardware partners.

What’s new?

Today we announced the next iteration of our enterprise-grade, open source approach in this spaceRed Hat Hyperconverged Infrastructure for Virtualization 1.5, which benefits from the combined strength of Red Hat Enterprise Linux, Red Hat Virtualization, Red Hat Gluster Storage, and Red Hat Ansible Automation.

Where’s the beef?

Red Hat Hyperconverged Infrastructure for Virtualization (RHHI-V) is an optimized, hyperconverged infrastructure (HCI) that has helped organizations across industries like energy, retail, banking, telco, and the public sector make the most of business-critical applications that must be deployed with limited space, budget, and IT staff, including departmental and lines of business ops, remote sites, and development and test environments. Integration with Red Hat Ansible Automation helps reduce manual errors normally associated with downtime while enabling a more streamlined and speedy deployment. Simplified administration via a single user interface means you can consolidate your infrastructure and adopt a software-defined datacenter more efficiently. Such adoption includes using RHHI-V in lieu of a more expensive VMware “lock-in” environment or transitioning from it under professional guidance with the Red Hat infrastructure migration solution.

What’s inside?

Red Hat Hyperconverged Infrastructure for Virtualization 1.5 now features advanced data reduction capabilities for even greater efficiencies as well as a series of validated server configurations for optimized workloads to reduce or eliminate the guesswork out of infrastructure deployment. Details follow:

  • Data reduction via deduplication and compression. Made possible through embedded Virtual Data Optimizer (VDO) code in Red Hat Enterprise Linux, you can now efficiently eliminate duplicate instances of repeating data and compress the reduced data set. This results in improved storage utilization and enables more affordable high-performance storage options.
  • Virtual graphics processing unit (vGPU). With the vGPU capability, you can assign GPU slices to VMs to accelerate 3D graphics and to offload computationally heavy jobs, including applications in computational science, workloads in oil and gas and manufacturing, as well as emerging AI and machine learning applications processing.
  • Open Virtual Network support. Support for software-defined networking via Open Virtual Network (OVN) helps improve scalability while enabling live migration of virtual networking components in a hyperconverged Linux environment.
  • Deep Ansible integration. Red Hat Ansible Automation enables true “ops value” at deploy and runtime, thereby paving the way toward your broader automation goals. We also deliver Ansible playbooks to enable remote replication and recovery of RHHI-V environments.
  • Validated hardware configurations. To help ensure RHHI-V users deploy sound infrastructure configurations, Red Hat has tested a number of use cases with our hardware partners and documents configuration guidelines for optimized workloads. These configurations, along with our new RHHI-V sizing tool, can help you anticipate platform requirements based on their usage patterns, taking the guesswork out of deploying a software-defined HCI platform, and reducing time to value. You can choose among industry standard hardware and enjoy more predictable performance for their desired deployment patterns.

Who benefits?

While RHHI-V was initially targeted at remote office/branch office deployment, we’ve experienced steadily increasing demand to support more mission-critical applications, such as remote tactical operations for public sector, field analysis and oil rig operations in the energy sector, and managing data from a myriad of sensors in factories across both process and discrete manufacturing. Now integrated even more broadly across the Red Hat software stack, RHHI-V is a powerful, general purpose platform for anyone seeking to jumpstart edge computing or modernize their existing data center to accommodate new workloads with greater degrees of efficiency. 

How can you learn more?

For more information on Red Hat Hyperconverged Infrastructure for Virtualization, check out this article by Storage Switzerland. Feel free to also attend our upcoming webinar on December 11. You can always simply access us on the web.

 

Five reasons you need to change your data storage—now

By Terry L. Smith, Senior Director, Penguin Computing’s Advanced Solutions Group

Transformation of the data storage industry in recent years has been dramatic. We’ve seen the development of new, component technologies, yielding higher capacities and performance. But more profound is the general acceptance that the old, proprietary, monolithic approach to storage simply cannot keep up with business needs. Open, software-defined storage delivers a flexilble, cost-efficient alternative to traditional storage appliances while being better able to handle the demands required by modern workloads.

Penguin Computing and Red Hat together deliver comprehensive, open, software-defined storage solutions, expertly architected and configured to meet your business requirements.

But why should you consider complementing your existing monolithic appliance storage with a software-defined approach?  I see five key reasons:

  1. Your data storage requirements keep growing, but traditional storage appliances are not built to handle them.
    There is only so much scaling up you can do with a traditional storage appliance. To keep up with your growing data storage needs, you find yourself in a cycle of “upgrades by replacement.” This is a huge capital burden, exacerbated by additional costs to license and support both the old and new systems during the upgrade migration. Worse still, you may even need to “upgrade” before the appliance’s expected end-of-life, when it would be fully amortized. With an open, scale-out, software-defined storage solution, you can take control. Built with industry-standard server technologies, you can scale out your open storage in manageable units and replace hardware only when needed. You can control your storage growth in a way that cost-effectively meets your needs, not the needs of the vendor.
  2. Your data storage solution should be feature-rich and flexible.
    With traditional storage appliances, your options for capacity and performance may be severely limited. And, other features, like advanced data protection and access protocol support, may be unavailable or require additional licensing. You may even be required to purchase a completely new appliance. But open, software-defined storage solutions empower you with features and flexibility out-of-the-box, often with all-inclusive software pricing. Hardware, software, and support can be decoupled, giving you the ability to work with vendors of your choice and sculpt the cost-effective solution that fits your business needs.
  3. You should have control of your storage support costs.
    Most traditional storage vendors have a business model based on volume of units sold. A “next-generation” box comes out on a regular schedule, and customers are expected to purchase the “upgrade.” To encourage this, traditional storage vendors often keep raising the cost of support for the older appliance. And, if you stop paying for support, the appliance may even stop working. So, you end up buying the new box, even if the older appliance is still capable of meeting your needs. Open storage solutions let you decide how and when to handle hardware and software upgrades. In fact, you could use a rolling upgrade, where old industry standard server equipment is replaced by new equipment as needed and the software subscription is rolled over from the old equipment. This helps eliminate the traditional “migration” concept and its  associated costs. And enterprise-level software support is typically at a flat, predictable rate, which is often lower than the average support cost for proprietary, traditional storage appliances.
  4. You can avoid vendor lock-in and keep your options open.
    Most traditional storage vendors count on locking you into their ecosystems, limiting your upgrade and support options. You can even be legally restricted from making any changes to the storage appliance, such as buying disks directly from disk vendors, to better meet your business requirements or keep using it past the expiration of the support contract. Open technology solutions free you from these limitations and restrictions. If you’re comfortable working directly with the open source and can support it, you may even replace or modify the software layer to meet your own, specific requirements without getting permission from anyone. The message here is clear: You are in control of your options.
  5. You can be ready for the next industry shift, like hybrid-cloud computing with open, software-defined storage.
    Most traditional storage vendors rely on costly and often small development teams who lack the scale to keep up with changing business needs. Open technology solutions, however, generally are created by some of the largest development communities in the world with guidance, vetting, and end-customer support delivered by world-class solution providers who understand business. The result is that open technologies can deliver reliable, feature-rich solutions capable of meeting your business needs now and in the future.

Penguin Computing and Red Hat have been bringing open technology solutions to enterprises for over two decades. With Penguin Computing’s FrostByte family of software-defined storage solutions, featuring Red Hat Ceph Storage and Red Hat Gluster Storage, businesses can break free of the traditional storage appliance without giving up enterprise-quality hardware, software, and services.

You can learn more about Penguin FrostByte with Red Hat Gluster Storage here and Penguin FrostByte with Red Hat Ceph Storage here.

About Terry
Terry L. Smith is senior director of Penguin Computing’s Advanced Solutions Group (ASG). Terry came to Penguin Computing in 2014 with a history of entrepreneurship and deep technical expertise. Launched in 2017, Terry’s group has opened new markets with solutions featuring advanced technologies and designed with world-class partnerships. This includes the FrostByte family of software-defined storage solutions featuring Red Hat Storage. One of ASG’s successes features FrostByte with Red Hat Gluster Storage delivered as an ongoing service for a Fortune 500 financial services institution.

Running OpenShift Container Storage 3.10 with Red Hat OpenShift Container Platform 3.10

By Annette Clewett anJose A. Rivera

With the release of Red Hat OpenShift Container Platform 3.10, we’ve officially rebranded what used to be referred to as Red Hat Container-Native Storage (CNS) as Red Hat OpenShift Container Storage (OCS). Versioning remains sequential (i.e, OCS version 3.10 is the follow on to CNS 3.9). You’ll continue to have the convenience of OCS 3.10 as part of the normal OpenShift deployment process in a single step, and OpenShift Container Platform (OCP) evaluation subscription has access to OCS evaluation binaries and subscriptions.

OCS 3.10 introduces an important feature for container-based storage with OpenShift. Arbiter volume support allows for there to be only two replica copies of the data, while still providing split-brain protection and ~30% savings in storage infrastructure versus a replica-3 volume. This release also hardens block support for backing OpenShift infrastructure services. Detailed information on the value and use of OCS 3.10 features can be found here.

OCS 3.10 installation with OCP 3.10 Advanced Installer

Let’s now take a look at the installation of OCS with the OCP Advanced Installer. OCS can provide persistent storage for both OCP’s infrastructure applications (e.g., integrated registry, logging, and metrics), as well as  general application data consumption. Typically, both options are used in parallel, resulting in two separate OCS clusters being deployed in a single OCP environment. It’s also possible to use a single OCS cluster for both purposes.

Following is an example of a partial inventory file with selected options concerning deployment of OCS for applications and an additional OCS cluster for infrastructure workloads like registry, logging, and metrics storage. When using these options for your deployment, values with specific sizes (e.g., openshift_hosted_registry_storage_volume_size=10Gi) or node selectors  (e.g., node-role.kubernetes.io/infra=true) should be adjusted for your particular deployment needs.

If you’re planning to use gluster-block volumes for logging and metrics, they can now be installed when OCP is installed. (Of course, they can also be installed later.)

[OSEv3:children]
...
nodes
glusterfs
glusterfs_registry

[OSEv3:vars]
...      
# registry
openshift_hosted_registry_storage_kind=glusterfs       
openshift_hosted_registry_storage_volume_size=10Gi   
openshift_hosted_registry_selector="node-role.kubernetes.io/infra=true"

# logging
openshift_logging_install_logging=true
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_size=50Gi
openshift_logging_es_cluster_size=3
openshift_logging_es_pvc_storage_class_name='glusterfs-registry-block'
openshift_logging_kibana_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_curator_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"}

# metrics
openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=dynamic
openshift_metrics_storage_volume_size=20Gi
openshift_metrics_cassandra_pvc_storage_class_name='glusterfs-registry-block'
openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/infra": "true"}

# Container image to use for glusterfs pods
openshift_storage_glusterfs_image="registry.access.redhat.com/rhgs3/rhgs-server-rhel7:v3.10"

# Container image to use for gluster-block-provisioner pod
openshift_storage_glusterfs_block_image="registry.access.redhat.com/rhgs3/rhgs-gluster-block-prov-rhel7:v3.10"

# Container image to use for heketi pods
openshift_storage_glusterfs_heketi_image="registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7:v3.10"
 
# OCS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=false   

# OCS storage cluster for OpenShift infrastructure
openshift_storage_glusterfs_registry_namespace=infra-storage  
openshift_storage_glusterfs_registry_storageclass=false       
openshift_storage_glusterfs_registry_block_deploy=true   
openshift_storage_glusterfs_registry_block_host_vol_create=true    
openshift_storage_glusterfs_registry_block_host_vol_size=200   
openshift_storage_glusterfs_registry_block_storageclass=true
openshift_storage_glusterfs_registry_block_storageclass_default=false

...
[nodes]
ose-app-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node03.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node04.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node01.ocpgluster.com openshift_node_group_name="node-config-infra"
ose-infra-node02.ocpgluster.com openshift_node_group_name="node-config-infra"
ose-infra-node03.ocpgluster.com openshift_node_group_name="node-config-infra"

[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs_registry]
ose-infra-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'

Inventory file options explained

The first section of the inventory file defines the host groups the installation will be using. We’ve defined two new groups: (1) glusterfs and (2) glusterfs_registry. The settings for either group all start with either openshift_storage_glusterfs_ or openshift_storage_glusterfs_registry. In each group, the nodes that will make up the OCS cluster are listed, and the devices ready for exclusive use by OCS are specified (glusterfs_devices=).

The first group of hosts in glusterfs specifies a cluster for general-purpose application storage and will, by default, come with the StorageClass glusterfs-storage to enable dynamic provisioning. For high availability of storage, it’s very important to have four nodes for the general-purpose application cluster, glusterfs.

The second group, glusterfs_registry, specifies a cluster that will host a single, statically deployed PersistentVolume for use exclusively by a hosted registry that can scale. This cluster will not offer a StorageClass for file-based PersistentVolumes with the options and values as they are currently configured (openshift_storage_glusterfs_registry_storageclass=false). This cluster will also support gluster-block (openshift_storage_glusterfs_registry_block_deploy=true). PersistentVolume creation can be done via StorageClass glusterfs-registry-block (openshift_storage_glusterfs_registry_block_storageclass=true). Special attention should be given to choosing the size for openshift_storage_glusterfs_registry_block_host_vol_size. This is the hosting volume for gluster-block devices that will be created for logging and metrics. Make sure that the size can accommodate all these block volumes and that you have sufficient storage if another hosting volume must be created.

If you want to tune the installation, more options are available in the Advanced Installation. To automate the generation of required inventory file options as shown previously, check out this newly available red-hat-storage tool called “CNS Inventory file Creator” or CIC (alpha version at this time). The CIC tool creates CNS or OCS inventory file options for both OCP 3.9 and OCP 3.10, respectively. CIC will ask a series of questions about the OpenShift hosts, the storage devices, sizes of PersistentVolumes for registry, logging and metrics and has baked-in checks to make sure the OCP installation will be successful. This tool  is currently alpha state, and we’re looking for feedback. Download it from github repository openshift-cic.

Single OCS cluster installation

Again, it is possible to support both general-application storage and infrastructure storage in a single OCS cluster. To do this, the inventory file options will change slightly for logging and metrics. This is because when there is only one cluster, the gluster-block StorageClass would be glusterfs-storage-block. The registry PV will be created on this single cluster if the second cluster, [glusterfs_registry], does not exist. For high availability, it’s very important to have four nodes for this cluster.  Also, special attention should be given to choosing the size for openshift_storage_glusterfs_block_host_vol_size. This is the hosting volume for gluster-block devices that will be created for logging and metrics. Make sure that the size can accommodate all these block volumes and that you have sufficient storage if another hosting volume must be created.

[OSEv3:children]
...
nodes
glusterfs

[OSEv3:vars]
...      
# registry
...

# logging
openshift_logging_install_logging=true
...
openshift_logging_es_pvc_storage_class_name='glusterfs-storage-block'
... 

# metrics
openshift_metrics_install_metrics=true
...
openshift_metrics_cassandra_pvc_storage_class_name='glusterfs-storage-block'

...

# OCS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=true
openshift_storage_glusterfs_block_host_vol_create=true
openshift_storage_glusterfs_block_host_vol_size=100
openshift_storage_glusterfs_block_storageclass=true
openshift_storage_glusterfs_block_storageclass_default=false
...

[nodes]

ose-app-node01.ocpgluster.com openshift_node_group_name="node-config-compute"   
ose-app-node02.ocpgluster.com openshift_node_group_name="node-config-compute" 
ose-app-node03.ocpgluster.com openshift_node_group_name="node-config-compute" 
ose-app-node04.ocpgluster.com openshift_node_group_name="node-config-compute" 

[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

OCS 3.10 uninstall

With the OCS 3.10 release, the uninstall.yml playbook can be used to remove all gluster and heketi resources. This might come in handy when there are errors in inventory file options that cause the gluster cluster to deploy incorrectly.

If you’re removing an OCS installation that is currently being used by any applications, you should remove those applications before removing OCS, because they will lose access to storage. This includes infrastructure applications like registry, logging, and metrics that have PV claims created using the glusterfs-storage and glusterfs-storage-block Storage Class resources.

You can remove logging and metrics resources by re-running the deployment playbooks like this:

ansible-playbook -i <path_to_inventory_file> -e
"openshift_logging_install_logging=false"
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

ansible-playbook -i <path_to_inventory_file> -e
"openshift_logging_install_metrics=false"
/usr/share/ansible/openshift-ansible/playbooks/openshift-metrics/config.yml

Make sure to manually remove any logging or metrics PersistentVolumeClaims. The associated PersistentVolumes will be deleted automatically.

If you have the registry using a glusterfs PersistentVolume, remove it with the following command:

oc delete deploymentconfig docker-registry
oc delete pvc registry-claim
oc delete pv registry-volume
oc delete service glusterfs-registry-endpoints

If running the uninstall.yml because a deployment failed, run the uninstall.yml playbook with the following variables to wipe the storage devices for both glusterfs and glusterfs_registry before trying the OCS installation again.

ansible-playbook -i <path_to_inventory file> -e
"openshift_storage_glusterfs_wipe=True" -e
"openshift_storage_glusterfs_registry_wipe=true"
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml

OCS 3.10 post installation for applications, registry, logging and metrics

You can add OCS clusters and resources to an existing OCP install using the following command. This same process can be used if OCS has been uninstalled due to errors.

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml

After the new cluster(s) is created and validated, you can deploy the registry using a newly created glusterfs ReadWriteMany volume. Run this playbook to create the registry resources:

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-hosted/config.yml

You can now deploy logging and metrics resources by re-running these deployment playbooks:

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

ansible-playbook -i <path_to_inventory_file>
/usr/share/ansible/openshift-ansible/playbooks/openshift-metrics/config.yml

Want to learn more?

For hands-on experience combining OpenShift and OCS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also, watch this short video explaining why to use OCS with OCP. Detailed information on the value and use of OCS 3.10 features can be found here.

Improved volume management for Red Hat OpenShift Container Storage 3.10

By Annette Clewett and Husnain Bustam

Hopefully by now you’ve seen that with the release of Red Hat OpenShift Container Platform 3.10 we’ve rebranded our container-native storage (CNS) offering to be called Red Hat OpenShift Container Storage (OCS). Versioning remains sequential (i.e, OCS 3.10 is the follow on to CNS 3.9).

OCS 3.10 introduces important features for container-based storage with OpenShift. Arbiter volume support allows for there to be only two replica copies of the data, while still providing split-brain protection and ~30% savings in storage infrastructure versus a replica-3 volume. This release also hardens block support for backing OpenShift infrastructure services. In addition to supporting arbiter volumes, major improvements to ease operations are available to give you the ability to monitor provisioned storage consumption, expand persistent volume (PV) capacity without downtime to the application, and use a more intuitive naming convention for PVs.

For easy evaluation of these features, an OpenShift Container Platform evaluation subscription now includes access to OCS evaluation binaries and subscriptions.

New features

Now let’s dive deeper into the new features of the OCS 3.10 release:

  • Prometheus OCS volume metrics: Volume consumption metrics data (e.g., volume capacity, available space, number of inodes in use, number of inodes free) available in Prometheus for OCS are very useful. These metrics monitor storage capacity and consumption trends and take timely actions to ensure applications do not get impacted.
  • Heketi topology and configuration metrics: Available from the Heketi HTTP metrics service endpoint, these metrics can be viewed using Prometheus or curl http://<heketi_service_route>/metrics. These metrics can be used to query heketi health, number of nodes, number of devices, device usage, and cluster count.
  • Online expansion of provisioned storage: You can now expand the OCS-backed PVs within OpenShift by editing the corresponding claim (oc edit pvc <claim_name>) with the new desired capacity (spec→ requests → storage: new value).
  • Custom volume naming: Before this release, the names of the dynamically provisioned GlusterFS volumes were auto-generated with random uuid number. Now, by adding a custom volume name prefix, the GlusterFS volume name will include the namespace or project as well as the claim name, thereby making it much easier to map to a particular workload.
  • Arbiter volumes: Arbiter volumes allow for reduced storage consumption and better performance across the cluster while still providing the redundancy and reliability expected of GlusterFS.

Volume and Heketi metrics

As of OCP 3.10 and OCS 3.10, the following metrics are available in Prometheus (and by executing curl http://<heketi_service_route>/metrics):

kubelet_volume_stats_available_bytes:      Number of available bytes in the volume
kubelet_volume_stats_capacity_bytes: Capacity in bytes of the volume
kubelet_volume_stats_inodes: Maximum number of inodes in the volume
kubelet_volume_stats_inodes_free: Number of free inodes in the volume
kubelet_volume_stats_inodes_used: Number of used inodes in the volume
kubelet_volume_stats_used_bytes: Number of used bytes in the volume
heketi_cluster_count: Number of clusters
heketi_device_brick_count: Number of bricks on device
heketi_device_count: Number of devices on host
heketi_device_free: Amount of free space available on the device
heketi_device_size: Total size of the device
heketi_device_used: Amount of space used on the device
heketi_nodes_count: Number of nodes on the cluster
heketi_up: Verifies if heketi is running
heketi_volumes_count: Number of volumes on cluster

 

 

Populating Heketi metrics in Prometheus requires additional configuration of the Heketi service. You must add the bolded annotations using the following commands:

# oc annotate svc heketi-storage prometheus.io/scheme=http
# oc annotate svc heketi-storage prometheus.io/scrape=true
# oc describe svc heketi-storage
Name:           heketi-storage
Namespace:      app-storage
Labels:         glusterfs=heketi-storage-service
                heketi=storage-service
Annotations:    description=Exposes Heketi service
                prometheus.io/scheme=http
                prometheus.io/scrape=true
Selector:       glusterfs=heketi-storage-pod
Type:           ClusterIP
IP:             172.30.90.87
Port:           heketi  8080/TCP
TargetPort:     8080/TCP

Populating Heketi metrics in Prometheus also requires additional configuration of the Prometheus configmap. As shown in the following, you must modify the Prometheus configmap with the namespace of Hekti service and restart prometheus-0 pod:

# oc get svc --all-namespaces | grep heketi
appstorage       heketi-storage       ClusterIP 172.30.90.87  <none>  8080/TCP
# oc get cm prometheus -o yaml -n openshift-metrics
....
- job_name: 'kubernetes-service-endpoints'
   ...
   relabel_configs:
     # only scrape infrastructure components
     - source_labels: [__meta_kubernetes_namespace]
       action: keep
       regex: 'default|logging|metrics|kube-.+|openshift|openshift-.+|app-storage'
# oc scale --replicas=0 statefulset.apps/prometheus
# oc scale --replicas=1 statefulset.apps/prometheus

Online expansion of GlusterFS volumes and custom naming

First, let’s discuss what’s needed to allow expansion of GlusterFS volumes. This opt-in feature is enabled by configuring the StorageClass for OCS with the parameter allowVolumeExpansion set to “true,” enabling the feature gate ExpandPersistentVolumes. You can now dynamically resize storage volumes attached to containerized applications without needing to first detach and then attach a storage volume with increased capacity, which enhances application availability and uptime.

Enable the ExpandPersistentVolumes feature gate on all master nodes:

# vim /etc/origin/master/master-config.yaml
kubernetesMasterConfig:
  apiServerArguments:
    feature-gates:
    - ExpandPersistentVolumes=true
# /usr/local/bin/master-restart api
# /usr/local/bin/master-restart controllers

This release also supports adding a custom volume name prefix created with the volume name prefix, project name/namespace, claim name, and UUID (<myPrefix>_<namespace>_<claimname>_UUID). Parameterizing the StorageClass ( `volumenameprefix: myPrefix`) allows easier identification of volumes in the GlusterFS backend.

The new OCS PVs will be created with the volume name prefix, project name/namespace, claim name, and UUID (<myPrefix>_<namespace>_<claimname>_UUID), making it easier for you to automate day-2 admin tasks like backup and recovery, applying policies based on pre-ordained volume nomenclature, and other day-2 housekeeping tasks.

In this StorageClass, support for both online expansion of OCS/GlusterFS PVs and custom volume naming has been added.

# oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: glusterfs-storage
parameters:
  resturl: http://heketi-storage-storage.apps.ose-master.example.com
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: storage
  volumenameprefix: gf 
allowVolumeExpansion: true 
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete

❶ Custom volume name support: <volumenameprefixstring>_<namespace>_<claimname>_UUID
Parameter needed for online expansion or resize of GlusterFS PVs

Be aware that PV expansion is not supported for block volumes, only for file volumes.

Expanding a volume starts with editing the PVC field “requests:storage” with the new expanded size for the PersistentVolume. For example, we have 1GiB PV, we want to expand the PV to 2GiB. To expand/resize PV to 2GiB, edit the PVC field “requests:storage” with the new value. The PV will be automatically resized to 2GiB. The new 2GiB size will be reflected in OCP, heketi-cli, and gluster commands. The expansion process creates another replica set and converts the 3-way replicated volume to distributed-replicated volume, 2×3 instead of 1×3 bricks.

GlusterFS arbiter volumes

Arbiter volume support is new to OCS 3.10 and has the following advantages:

  • An arbiter volume is still a 3-way replicated volume for highly available storage.
  • Arbiter bricks do not store file data; they only store file names, structure, and metadata.
  • Arbiter uses client quorum to compare this metadata with metadata of other nodes to ensure consistency of the volume and prevent split brain conditions.
  • Using Heketi commands, it is possible to control arbiter brick placement using tagging so that all arbiter bricks are on the same node.
  • With control of arbiter brick placement, the ‘arbiter’ node can have limited storage compared to other nodes in the cluster.

The following example has two gluster volumes configured across 5 nodes to create two 3-way arbitrated replicated volumes, with the arbiter bricks on a dedicated arbiter node.

In order to use arbiter volumes with OCP workloads, an additional parameter must be added to the GlusterFS StorageClass, user.heketi.arbiter true. In this StorageClass, support for the online expansion of GlusterFS PVs, custom volume naming, and arbiter volumes have been added.

# oc get sc glusterfs-storage -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: glusterfs-storage
parameters:
  resturl: http://heketi-storage-storage.apps.ose-master.example.com
  restuser: admin
  secretName: heketi-storage-admin-secret
  secretNamespace: storage
  volumenameprefix: gf 
  volumeoptions: user.heketi.arbiter true ❸
allowVolumeExpansion: true 
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete

❶ Custom volume name support: <volumenameprefixstring>_<namespace>_<claimname>_UUID
Parameter needed for online expansion or resize of GlusterFS volumes
❸ Enable arbiter volume support in the StorageClass. All the PVs created from this StorageClass will be 3-way arbitrated replicated volume.

Want to learn more?

For hands-on experience combining OpenShift and OCS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also, check out  this short video explaining why using OCS with OpenShift is the right choice for the container storage infrastructure. For details on running OCS 3.10 with OCP 3.10, click here.

Breaking down data silos with Red Hat infrastructure

By Brent Compton, Senior Director, Technical Marketing, Red Hat Cloud Storage and HCI

Breaking down barriers to innovation.
Breaking down data silos.

These are arguably two of the top items on many enterprises’ wish lists. In the world of analytics infrastructure, people have described a solution to these needs as “multi-tenant workload isolation with shared storage.” Several public-cloud-based analytics solutions exist to provide this. However, many large Red Hat customers are doing large-scale analytics in their own data centers and were unable to solve these problems with their on-premises analytic infrastructure solutions. They turned to Red Hat private cloud platforms as their analytics infrastructure and achieved just this: multi-tenant workload isolation with shared storage. To be clear, Red Hat is not providing these customers with analytics tools. Instead, it is welcoming these analytics tools onto the same Red Hat infrastructure platforms running much of the rest of their other enterprise workloads.

Traditional on-premises analytics infrastructures do not provide on-demand provisioning for short-running analytics workloads, frequently needed by data scientists. In addition, traditional HDFS-based infrastructures do not share storage between analytics clusters. As such, traditional analytics infrastructures often don’t meet the competing needs of multiple teams needing different types of clusters, all with access to common data sets. Individual teams can end up competing for the same set of cluster resources, causing congestion in busy analytics clusters, leading to frustration and delays in getting insights from their data.

As a result, a team may demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own workload needs. Without a shared storage repository, this can lead to multiple analytic cluster silos, each with its own copy of data. Net result? Cost duplication and the burden of maintaining and tracking multiple data set copies.

An answer to these challenges? Bring your analytics workloads onto a common, scalable infrastructure.

Red Hat has seen customers solve these challenges by breaking down traditional Hadoop silos and bringing analytics workloads onto a common, private cloud infrastructure running in today’s enterprise datacenters. At its core is Red Hat Ceph Storage, our massively scalable, software-defined object storage platform, which enables organizations to more easily share large-scale data sets between analytics clusters. The on-demand provisioning of virtualized analytics clusters is enabled through Red Hat OpenStack Platform. Additionally, early adopters are deploying Apache Spark in kubernetes-orchestrated, container-based clusters via Red Hat OpenShift Container Platform. Delivery and support are provided by the IT experts at Red Hat Consulting based on documented leading practices to help establish an optimal architecture for our clients’ unique requirements.

Key benefits to customers

Agility

  • Get answers faster. By enabling teams to elastically provision their own dedicated analytics compute resources via Red Hat OpenStack Platform, teams have avoided cluster resource competition in order to better meet service-level agreements (SLAs). And teams can spin up these new analytics clusters without lengthy data-hydration delays (made possible by accessing shared data sets on Red Hat Ceph Storage).
  • Remove roadblocks. Empower teams of data scientists to use the analytics tools/versions they need through dynamically provisioned data labs and workload clusters (while still accessing shared data sets).
  • Hybrid cloud versatility. Enable your query authors to use the same S3 syntax in their queries, whether running on a private cloud or public cloud. Spark and other popular analytics tools can use the Hadoop S3A client to access data in S3-compatible object storage, in place of native HDFS. Ceph is the most popular S3-compatible open-source object storage backend for OpenStack.

Cost/risk reduction

  • Cut costs associated with data set duplication. In traditional Hadoop/Spark HDFS clusters, data is not shared. If a data scientist wants to analyze data sets that exists in two different clusters, they may need to copy data sets from one cluster to the other. This can result in duplicate costs for multi-PB data sets that must be copied among many analytics clusters.
  • Reduce risks of maintaining duplicate data sets. Duplicate data-set maintenance can be time-consuming and prone to error, but it can also result in incomplete or inaccurate insights being derived from stale data.
  • Scale costs based on requirements. In traditional Hadoop/Spark HDFS clusters, capacity is added by procuring more HDFS nodes with a fixed ratio of CPU and storage capacity. With Red Hat data analytics infrastructure, customers can provision compute servers separately from a common storage pool and thus can scale each resource according to need. By freeing storage capacity from compute cores previously locked together, companies can scale storage capacity costs independently of compute costs according to need.

Innovation for today’s data needs

As data continues to grow, organizations should have a supporting infrastructure that can break down data silos and enable teams to access and use information in more agile ways. Red Hat platforms can foster greater agility, efficiency, and savings–a nice combination for today’s data-driven organizations looking to build analytics applications across the open hybrid cloud.

You can also find our blog post that covers other news from the Strata conference and upstream community projects here. For more details on empirical test results, see here. For a video whiteboard of these topics, see here. Finally, To learn more, visit www.redhat.com/bigdata.

 

Introducing Red Hat Gluster Storage 3.4: Feature overview

By Anand Paladugu, Principal Product Manager

We’re pleased to announce that Red Hat Gluster Storage 3.4 is now Generally Available!

Since this release is a full rebase with the upstream, it consolidates many bug fixes, thus giving you a greater degree of overall stability for both container storage and traditional file serving use cases. Given that Red Hat OpenShift Container Storage is based on Red Hat Gluster Storage, these fixes will also be embedded in the 3.10 release of OpenShift Container Storage. To enable you to refresh your Red Hat Enterprise Linux (RHEL) 6-based Red Hat Gluster Storage installations, this release supports upgrading your Red Hat Gluster Storage servers from RHEL 6 to RHEL 7. Last, you can now deploy Red Hat Gluster Storage Web Administrator with minimal resources, which also offers robust and feature-rich monitoring capabilities.

Here is an overview of the new features delivered in Red Hat Gluster Storage 3.4:

Support for upgrading Red Hat Gluster Storage from RHEL 6 to RHEL 7

Many customers like to ensure they’re on the latest and greatest RHEL in their infrastructures. Two scenarios are now supported for upgrading RHEL servers in a Red Hat Gluster Storage deployment from RHEL 6 to RHEL 7:

  1. Red Hat Gluster Storage version is <= 3.3.x and the underlying RHEL version is <= latest version of 6.x. The upgrade process updates Red Hat Gluster Storage to version 3.4 and the underlying RHEL version to the latest version of RHEL 7.
  2. Red Hat Gluster Storage version is 3.4 and the underlying RHEL version is the latest version of 6.x. The upgrade process keeps the Red Hat Gluster Storage version at 3.4 and upgrades the underlying RHEL version to the latest version of RHEL 7.

MacOS client support

Mac workstations continue to make inroads into corporate infrastructures. Red Hat Gluster Storage 3.4 supports MacOS as a Server Message Block (SMB) client and thereby allows customers to map SMB shares backed by Red Hat Gluster Storage in the MAC finder tool.

Punch hole support for third-party applications

The “punch hole” feature provides the benefit of freeing up physical disk space when portions of a file are de-referenced. For example, suppose you’ve used up 20 Gigs of your disk space for backing up a file, and some portions of the file are de-referenced due to data duplication. Without punch hole support, the 20 Gigs remain occupied in the underlying physical hard disk. With support for punch holes, however, third-party applications can “punch a hole” corresponding to the portions of the deleted files, thereby freeing up physical disk space. This further helps to reduce storage costs associated with backing up and archiving those virtual machines (VMs).

Subdirectory exports using the Gluster Fuse protocol now fully supported

Beginning with Red Hat Gluster Storage 3.4, subdirectory export using Fuse is now fully supported. This feature provides namespace isolation where a single Gluster volume can be shared to many clients, and they can be mounting only a subset of the volume (namespace) (i.e., a subdirectory). You can also export a subdirectory of the already exported volume, to utilize space left in the volume for a different project.

Red Hat Gluster Storage web admin enhancements

The Web Administration tool delivers browser-based graphing, trending, monitoring, and alerting for Red Hat Gluster Storage in the enterprise. This latest Red Hat Gluster Storage release optimizes this web admin tool to consume fewer resources and allow greater scaling to monitor larger clusters than in the past.

Faster directory lookups using the Gluster NFS-Ganesha server

In Red Hat Gluster Storage 3.4, the Readdirp API is extended and enhanced to return handles along with directory stats as part of its reply, thereby reducing NFS operations latency.

In internal testing, performance gains were noticed for all directory operations when compared to Red Hat Gluster Storage 3.3.1. For example, make directory operations improved by up to 31%, file create operations have improved by up to 42%, and file read operations have improved by up to 150%.

Want to learn more?

For hands-on experience with Red Hat Gluster Storage, check out our test drive.

Red Hat OpenShift Container Platform 3.10 with Container-Native Storage 3.9

This post documents how to install Container-Native Storage 3.9 (CNS 3.9) with OpenShift Container Platform 3.10 (OCP 3.10). CNS provides persistent storage for OCP’s general-application consumption and for the registry.

CNS 3.9 installation with OCP 3.10 advanced installer

The deployment of CNS 3.9 can be accomplished using openshift-ansible playbooks and specific inventory file options. The first group of hosts in glusterfs specifies a cluster for general-purpose application storage and will, by default, come with the StorageClass glusterfs-storage to enable dynamic provisioning. For high availability of storage, it’s very important to have four nodes for the general-purpose application cluster, glusterfs. The second group, glusterfs_registry, specifies a cluster that will host a single, statically deployed PersistentVolume for use exclusively by a hosted registry that can scale. This cluster will not offer a StorageClass for file-based PersistentVolumes with the options and values as they are currently configured.

Following is an example of a partial inventory file with selected options concerning deployment of CNS 3.9 for applications and registry. When using options for deployment with values of specific sizes, (e.g., openshift_hosted_registry_storage_volume_size=10Gi) or node selectors, (e.g., node-role.kubernetes.io/infra=true) they should be adjusted for your particular deployment needs.

[OSEv3:children]
...
nodes
glusterfs
glusterfs_registry

[OSEv3:vars]
...      
# registry
openshift_hosted_registry_storage_kind=glusterfs       
openshift_hosted_registry_storage_volume_size=10Gi   
openshift_hosted_registry_selector="node-role.kubernetes.io/infra=true"

# Container image to use for glusterfs pods
openshift_storage_glusterfs_image="registry.access.redhat.com/rhgs3/rhgs-server-rhel7
:v3.9"
# Container image to use for gluster-block-provisioner pod
openshift_storage_glusterfs_block_image="registry.access.redhat.com/rhgs3/rhgs-gluster-
block-prov-rhel7:v3.9"
# Container image to use for heketi pods
openshift_storage_glusterfs_heketi_image="registry.access.redhat.com/rhgs3/rhgs-
volmanager-rhel7:v3.9"
 
# CNS storage cluster for applications
openshift_storage_glusterfs_namespace=app-storage
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_storageclass_default=false
openshift_storage_glusterfs_block_deploy=false

# CNS storage cluster for OpenShift infrastructure
openshift_storage_glusterfs_registry_namespace=infra-storage  
openshift_storage_glusterfs_registry_storageclass=false       
openshift_storage_glusterfs_registry_block_deploy=false   
openshift_storage_glusterfs_registry_block_host_vol_create=false    
openshift_storage_glusterfs_registry_block_host_vol_size=100  
openshift_storage_glusterfs_registry_block_storageclass=false
openshift_storage_glusterfs_registry_block_storageclass_default=false   

...
[nodes]

ose-app-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node03.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-app-node04.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node01.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node02.ocpgluster.com openshift_node_group_name="node-config-compute"
ose-infra-node03.ocpgluster.com openshift_node_group_name="node-config-compute"

[glusterfs]
ose-app-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'   
ose-app-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'
ose-app-node04.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'

[glusterfs_registry]
ose-infra-node01.ocpgluster.com glusterfs_zone=1 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node02.ocpgluster.com glusterfs_zone=2 glusterfs_devices='[ "/dev/xvdf" ]'
ose-infra-node03.ocpgluster.com glusterfs_zone=3 glusterfs_devices='[ "/dev/xvdf" ]'

CNS 3.9 uninstall

With this release, the uninstall.yml playbook can be used to remove all gluster and heketi resources. This might come in handy when there are errors in inventory file options that cause the gluster cluster to deploy incorrectly.

If you’re removing a CNS installation that is currently being used by any applications, you should remove those applications before removing CNS, because they will lose access to storage. This includes infrastructure applications like registry.

If you have the registry using a glusterfs PersistentVolume, remove it with the following command:

oc delete deploymentconfig docker-registry
oc delete pvc registry-claim
oc delete pv registry-volume
oc delete service glusterfs-registry-endpoints

If running the uninstall.yml because a deployment failed, run the uninstall.yml playbook with the following variables to wipe the storage devices for both glusterfs and glusterfs_registry clusters before trying the CNS installation again:

ansible-playbook -i <path_to_inventory file> -e
"openshift_storage_glusterfs_wipe=True" -e 
"openshift_storage_glusterfs_registry_wipe=true" 
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml

CNS 3.9 post installation for applications and registry

You can add CNS clusters and resources to an existing OCP install using the following command. This same process can be used if CNS has been uninstalled due to errors.

ansible-playbook -i <path_to_inventory_file> 
/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml

After the new cluster(s) is created and validated, you can deploy the registry using a newly created glusterfs ReadWriteMany volume. Run this playbook to create the registry resources:

ansible-playbook -i <path_to_inventory_file> 
/usr/share/ansible/openshift-ansible/playbooks/openshift-hosted/config.yml

Want to learn more?

For hands-on experience combining OpenShift and CNS, check out our test drive, a free, in-browser lab experience that walks you through using both. Also watch this short video explaining why use CNS with OpenShift.

Introducing OpenShift Container Storage: Meet the new boss, same as the old boss!

By Steve Bohac, Product Marketing

Today, we’re introducing Red Hat OpenShift Container Storage 3.10.

Is this product new to you? It surely is—that’s because with the announcement today of Red Hat OpenShift Container Platform 3.10, we’ve rebranded our container-native storage (CNS) offering to now be referred to as Red Hat OpenShift Container Storage. This is still the same product with the strong customer momentum we announced a few months ago during Red Hat Summit week.Why the new name? “Red Hat OpenShift Container Storage” better reflects the product offering and its strong affinity with Red Hat OpenShift Container Platform. Not only does it install with OpenShift (via Red Hat Ansible), it’s developed, qualified, tested, and versioned coincident with OpenShift Container Platform releases. This product name best reflects that strong integration. Again, the product itself didn’t change in any way—all that’s changed is the product name.

Red Hat OpenShift Container Storage enables application portability and a consistent user experience across the hybrid cloud.

This new release, Red Hat OpenShift Container Storage 3.10, is the follow-on to Container-Native Storage 3.9 and introduces three important features for container-based storage with OpenShift: (1) arbiter volume support enabling high availability with efficient storage utilization and better performance, (2) enhanced storage monitoring and configuration visibility using the OpenShift Prometheus framework, and (3) block-backed persistent volumes (PVs) now supported for general application workloads in addition to supporting OCP infrastructure workloads.

If you haven’t already bookmarked our Red Hat Storage blog, now would be a great time! Over the coming weeks, we will be publishing deeper discussions on OpenShift Container Storage. In the meantime, though, for a more thorough understanding of OpenShift Container Storage, check out these recent technical blogs describing in depth the value of our approach to storage for containers:

Want to learn more?

For more information on OpenShift Container Storage, click here. Also, you can find the new Red Hat OpenShift Container Storage datasheet here.

For hands-on experience combining OpenShift and OpenShift Container Storage, check out our test drive, a free, in-browser lab experience that walks you through using both.

For more general information around storage for containers, check out our Container Storage for Dummies book.

Storing tables in Ceph object storage

Introduction

In one of our previous posts, Anatomy of the S3A filesystem client, we showed how Spark can interact with data stored in a Ceph object storage in the same fashion it would interact with Amazon S3. This is all well and good if you plan on exclusively writing applications in PySpark or Scala, but wouldn’t it be great to allow anyone who is familiar with SQL to interact with data stored in Ceph?

That’s what SparkSQL is for, and while Spark has the ability to infer schema, it’s a lot easier if the data is already described in a metadata service like the Hive Metastore. The Hive Metastore stores table schema information, statistics on tables and partitions, and generally aids the query planners of various SQL engines query planners in constructing efficient query plans. So, regardless of whether you’re using good ol’ Hive, SparkSQL, Presto, or Impala, you’ll still be storing and retrieving metadata from a centralized store. Even if your organization has standardized on a single query engine, it still makes sense to have a centralized metadata service, because you’ll likely have distinct workload clusters that will want to share at least some data sets.

Architecture

The Hive Metastore can be housed in a local Apache Derby database for development and experimentation, but a more production-worthy approach would be to use a relational database like MySQL, MariaDB, or Postgres. In the public cloud, a best practice is to store the database tables on a distinct volume to get features like snapshots, and the ability to detach and reattach it to a different instance. In the private cloud, where OpenStack reigns supreme, most folks have turned to Ceph to provide block storage. To learn more about how to leverage Ceph block storage for database workloads, I suggest taking a look at the MySQL reference architecture we authored in conjunction with the open source database experts over at Percona.

While you can configure Hive, Spark, or Presto to interact directly with the MySQL database containing the Metastore, interacting with the Hive Server 2 Thrift service provides better concurrency and an improved security posture. Overall, the general idea is depicted in the following diagram:

Storing tabular data as objects

In a greenfield environment where all data will be stored in the object store, you could simply set hive.metastore.warehouse.dir to a S3A location a la s3a://hive/warehouse. If you haven’t already had a chance to read our Anatomy of the S3A filesystem client post, you should take a look if you’re interested in learning how to configure S3A to interact with a local Ceph cluster instead of Amazon S3. When a S3A location is used as the Metastore warehouse directory, all tables that are created will default to being stored in that particular bucket, under the warehouse pseudo directory. A better approach is to utilize external locations to map databases, tables, or simply partitions to different buckets – perhaps so they can be secured with distinct access controls or other bucket policy features. An example of including a external location specification during table creation might be:

create external table inventory
(
   inv_date_sk bigint,
   inv_item_sk bigint,
   inv_warehouse_sk bigint,
   inv_quantity_on_hand int
)
row format delimited fields terminated by ‘|’
location ‘s3a://tpc/inventory’;

That’s it, when you interact with this inventory table, data will be  directly read from the object store by way of the S3A filesystem client. One of the cool aspects of this approach is the location is abstracted away, you can write queries that scan tables with different locations, or even scan a single table with multiple locations. In this fashion, you might have recent data partitions with a MySQL external location, and data older than the current week in partitions with external locations that point to object storage. Cool stuff!

Serialization, partitions, and statistics

We all want to be able to analyze data sets quickly, and there are a number of tools available to help realize this goal. The first is using different serialization formats. In my discussions with customers, the two most common serialization formats are the columnar formats ORC and Parquet. The gist of these formats is that instead of requiring complete scans of entire files, columns of data are separated into stripes and metadata describing each column’s stripe offsets are stored in a file header or footer. When a query is planned, requests can read in only the stripes that are relevant to that particular query. For a more on different serialization formats, and their relative performance, I highly suggest this analysis by our friends over at Silicon Valley Data Science. We have seen great performance with both Parquet and ORC when used in conjunction with a Ceph object store. Parquet tends to be slightly faster, while ORC tends to use slightly less disk space. This small delta might simply be the result of these formats using different compression algorithms by default (snappy vs ZLIB). Speaking of compression, it’s really easy to think you’re using it, when you are in fact not. Make sure to verify that your tables are actually being compressed. I suggest including the compression specification in table creation statements instead of hoping the engine you are using has the defaults configured the way you want.

In addition to serialization formats, it’s important to consider how your tables are partitioned, and how many files you have per partition. All S3 API calls are RESTful, which means they are heavier weight than HDFS RPC calls. Having fewer larger partitions, with fewer files per partition, will definitely translate into higher throughput and reduced query latency. If you already have tables with loads of partitions, and many files per partition, it might be worthwhile to consolidate them with larger partitions with a fewer files each as you move them into object storage.

With data serialized and partitioned intelligently, queries can be much more efficient, but there is a third way you can help the query planner of your execution engine do its job better – table and column statistics. Table statistics can be collected with ANALYZE TABLE table COMPUTE STATISTICS statements, which count the number of rows for a particular table and their partitions. The row counts are stored in the Metastore, and can be used by other engines that interrogate the Metastore during query planning.

To the cloud!

Getting cloudy

Many modern enterprises have initiatives underway to modernize their IT infrastructures, and today that means moving workloads to cloud environments, whether they be public or private. On the surface, moving data platforms to a cloud environment shouldn’t be a difficult undertaking: Leverage cloud APIs to provision instances, and use those instances like their bare-metal brethren. For popular analytics workloads, this means running storage services in those instances that are specific to analytics and continuing  a siloed approach to storage. This is the equivalent of lift and shift for data-intensive apps, a shortcut approach undertaken by some organizations when migrating an enterprise app to a cloud when they don’t have the luxury of adopting a more contemporary application architecture.

The following data platform principles pertain to moving legacy data platforms to either a public cloud or a private cloud. The private cloud storage platform discussed is Ceph, of course, a popular open-source storage platform used in building private clouds for a variety of data-intensive workloads, including MySQL DBaaS and Spark/Hadoop analytics-as-a-service.

Elasticity

Elasticity is one of the key benefits of cloud infrastructure, and running storage services inside your instances definitely cramps your ability to take advantage of it. For example, let’s say you have an analytics cluster consisting of 100 instances and the resident HDFS cluster has a utilization of 80 percent. Even though you could terminate 10 of those instances and still have sufficient storage space, you would need to rebalance the data, which is often undesirable. You will also be out of luck if months later you also realize that you’re only using half the compute resources of that cluster. If the infrastructure teams make a new instance flavor available, say with fancy GPUs for your hungry machine-learning applications, it’ll be much harder to start consuming them if it entails the migration of storage services.

This is why companies like Netflix decided to use object storage as the source of truth for the analytics applications, as detailed in my previous post What about locality? It enables them to expand, and contract, workload-specific clusters as dictated by their resource requirements. Need a quick cluster with lots of nodes to chew through a one-time ETL? No problem. Need transient data labs for data scientists by day only to relinquish those resources for use for reporting after hours? Easy peasy.

Data infrastructure

Before departing on the journey to cloudify an organization’s data platform architecture, an important first step is assessing the capabilities of the cloud infrastructure, public or private, you intend to consume to make sure it provides the features that are most important to data-intensive applications. World-class data infrastructure provides their tenants with a number of fundamental building blocks that lend power and flexibility to the applications that will sit atop them.

Persistent block storage

Not all data is big, and it’s important to provide persistent block storage for data sets that are well served by database workhorses like MySQL and Postgres. An obvious example is the database used by the Hive metastore. With all these workload clusters being provisioned and deprovisioned, it’s often desirable to have them interact with a common metadata service. For more details about how persistent block storage fits into the dizzying array of architectural decisions facing database administrators, I suggest a read of our MySQL reference architecture.

I also suggest infrastructure teams learn how to collapse persistent block storage performance and spatial capacity into a single dimension, all while providing deterministic performance. For this, I recommend watching the session I gave with several of my colleagues at Red Hat Summit last year.

Local SSD

Sometimes we need to access data fast, really fast, and the best way to realize that is with locally attached SSDs. Most clouds make this possible with special instance flavors, modeled after the i3 instances provided by Amazon EC2. In OpenStack, the equivalent would be instances where the hypervisor uses PCIe passthrough for NVMe devices. These devices are best leveraged by applications that handle their own replication and fault tolerance, good examples being Cassandra and Clustered Elasticsearch. Fast local devices are also useful for scratch space for intermediate shuffle data that doesn’t fit in memory, or even S3A buffer files.

GPUs

Machine learning frameworks like TensorFlow, Torch, and Caffe can all benefit from GPU acceleration. With the burgeoning popularity of these frameworks, it’s important that infrastructure cater to them by providing instances flavors infused with GPU goodness. In OpenStack, this can be accomplished by passing through entire GPU devices in a similar fashion detailed in the Local SSD section, or by using GPU virtualization technologies like Intel GVT-g or NVIDIA GRID vGPU. OpenStack developers have been diligently integrating these technologies, I’d recommend operations folk understand how to deploy them once these features mature.

Object storage

In both public and private clouds, deploying multiple analytics clusters backed by object storage is becoming increasingly popular. In the private cloud, a number of things are important to prepare a Ceph object store for data intensive applications.

Bucket sharding

Bucket sharding was enabled by default with the advent of Red Hat Ceph Storage 3.0. This feature spreads a bucket’s indexes across multiple shards with corresponding RADOS objects. This is great for increasing the write throughput of a particular bucket, but comes at the expense of LIST operations. This is because of the way index entries are interleaved, and needing to gather entries before replying to a LIST request. Today, the S3A filesystem client performs many LIST requests, and as such it is advantageous to disable bucket sharding with rgw_override_bucket index_max_shards set to 1.

Bucket indexes on SSD

The Ceph object gateway uses distinct pools for objects and indexes, and as such those pools can be mapped to different device classes. Due to the S3A filesystem client’s heavy usage of LIST operations, it’s highly recommended that index pools be mapped to OSDs sitting on SSDs for fast access. In many cases, this can be achieved even on existing hardware by using the remaining space on devices that house OSD journals.

Erasure coding

Due to the immense storage requirements of data platforms, erasure coding for the data section of objects is a no brainer. Compared to 3x replication that’s common with HDFS, erasure coding reduced the required storage by 50%. When tens of petabytes are involved, that amounts to big savings! Most folks will probably end up using either 4+2 or 8+3 and spreading chunks across hosts using ruleset-failure-domain=host.