Storage for RHV and OCP: Two Glusters on one platform

Architecture is an interesting discipline. There are whitepapers and best practices and reference architectures to offer pristine views of what your perfect deployment should look like. And then there are budgets and timelines and business requirements to derail all of that. It’s what makes this job so interesting and challenging—hacking together the best pieces of disparate and often seemingly unrelated systems to meet goals driven by six leaders whose bonuses are met by completely different metrics.

A recent project has involved combining OpenShift Container Platform (OCP), Red Hat Virtualization (RHV), and Red Hat Gluster Storage (Gluster) into a unified system with common lifecycle operations, minimized management points, and the lowest overall footprint in terms of both capital cost and TCO. The primary storage challenge here is in creating a Gluster environment to support both RHV and its VMs as well as OCP container persistent volume requirements.

Our architectural goals include:

  • Purchase a single flexible hardware platform to serve all the storage needs
  • Segregate Gluster for RHV and Gluster for OCP into separate pools for resource allocation and to avoid possible administration snafus (such as we experienced in early testing)
  • Maintain a single-point and single-method of management—one Heketi server to rule them all
  • Containerize as much as possible to keep lifecycle maintenance atomic

Our early version of the architecture had Gluster running as container-native storage (CNS) for OCP on top of RHV while also serving storage to RHV, but this proved to introduce a chicken-and-egg problem where a single failure (such as an etcd crash) could cause a cascading outage. So our redesign involved splitting Gluster off from OCP as a stand-alone system while still being a unified storage provider and leveraging container atomicity.

The approach we wanted involved containerized Gluster running on bare-metal container hosts. Fundamentally, this is actually pretty straightforward today with pre-build Gluster containers available from the Red Hat registry. What complicated this was our desire to run two separate containerized Gluster pools on the same hardware nodes.

Disclaimer

There’s a pretty good chance that this architecture is not explicitly supported by Red Hat. While all the components we use here are definitely supported, this particular combination is untested by our engineering, QE, and performance teams. Don’t consider anything here a recommendation for how you should run your environment, only an academic study of a possible approach to solving an interesting challenge. If you have any questions, please reach out to your Red Hat sales and support teams.

The platform

We initially wanted to build this on top of Red Hat Enterprise Linux Atomic Host, but our lab environment wasn’t setup to provision this build on our systems, so we had to go forward with RHEL plus the docker packages. For a production build, we would return to using Atomic.

Networking

Gluster containers are usually configured with host networking because they need to communicate freely with each other and need to serve storage out to other systems and containers. However, with host networking, the Gluster ports are bound to all interfaces, so it is not possible to run two Gluster containers in this mode due to port conflicts. To solve this, the networks for each Gluster pool had to be segregated.

First, a VLAN sub-interface was created on each Gluster node for the storage network interface and using VLAN ID 199. There are ifcfg files to make these persistent. So each node includes a 192.168.99.0/24 IP on the primary interface and a 192.168.199.0/24 IP on a VLAN sub-interface. The Switch ports for the storage network interfaces have been configured for the tagged VLAN ID 199. The 802.1q kernel module (for VLANs) was set to load at boot time on each node with a /etc/modules-load.d/8021q.conf file.

Containerized Gluster

Networks

Each Gluster container needs to exist on its own interface and subnet. So leveraging the system-level network stuff done above, the two interfaces were each attached to a docker macvlan network on each node.

docker network create -d macvlan --subnet=192.168.99.0/24 \

-o parent=eth1 gluster-rhv-net
docker network create -d macvlan --subnet=192.168.199.0/24 \

-o parent=eth1.199 gluster-ocp-net

Containers

The containers were pulled down from the Red Hat registry.

docker pull registry.access.redhat.com/rhgs3/rhgs-server-rhel7
docker pull registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7

The Gluster containers need to be privileged in order to access the /dev/sdX block devices. They also need a number of local persistent volume stores in order to ensure they start up properly each time.

The container fstab file needs a persistent mount. So first we should touch these files, otherwise the gluster-startup command in the container will fail.

touch /var/lib/heketi-{rhv,ocp}/fstab

Then we can run the containers.

docker run -d --privileged=true --net=gluster-rhv-net \

--ip=192.168.99.28  --name=gluster-rhv-1 -v /run \

-v /home/gluster-rhv-1-root:/root:z \

-v /etc/glusterfs-rhv:/etc/glusterfs:z \

-v /var/lib/glusterd-rhv:/var/lib/glusterd:z \

-v /var/log/glusterfs-rhv:/var/log/glusterfs:z \

-v /var/lib/heketi-rhv:/var/lib/heketi:z \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /dev:/dev rhgs3/rhgs-server-rhel7
docker run -d --privileged=true --net=gluster-ocp-net \

--ip=192.168.199.28 --name=gluster-ocp-1 -v /run \

-v /home/gluster-ocp-1-root:/root:z \

-v /etc/glusterfs-ocp:/etc/glusterfs:z \

-v /var/lib/glusterd-ocp:/var/lib/glusterd:z \

-v /var/log/glusterfs-ocp:/var/log/glusterfs:z \

-v /var/lib/heketi-ocp:/var/lib/heketi:z \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /dev:/dev rhgs3/rhgs-server-rhel7

Block device assignments

Running the containers in privileged mode allows them to access all system block devices. For our particular architectural needs, we intend to use from each node only one SSD for the gluster-rhv pool and the remaining five SSDs for the gluster-ocp pool.

 Gluster Pool  Block Devices
 gluster-rhv  sdb
 gluster-ocp  sdc, sdd, sde, sdf, sdg

Heketi

Config

The persistent Heketi config is being stored in the /etc/heketi directory on one of the nodes (we’ll call it node1). First, an ssh keypair is created and placed there.

ssh-keygen -f /etc/heketi/heketi_key -t rsa -N ''

Next, the heketi.json file is created. Right now, no auth is being used — obviously don’t do this in production. Note the ssh port is 2222, which is what the Gluster containers are configured to listen on.

{
  "_port_comment": "Heketi Server Port Number",
  "port": "8080",

  "_use_auth": "Enable JWT authorization. Please enable for deployment",
  "use_auth": false,

  "_jwt": "Private keys for access",
  "jwt": {
    "_admin": "Admin has access to all APIs",
    "admin": {
      "key": "My Secret"
    },
    "_user": "User only has access to /volumes endpoint",
    "user": {
      "key": "My Secret"
    }
  },

  "_glusterfs_comment": "GlusterFS Configuration",
  "glusterfs": {
    "_executor_comment": [
      "Execute plugin. Possible choices: mock, ssh",
      "mock: This setting is used for testing and development.",
      "      It will not send commands to any node.",
      "ssh:  This setting will notify Heketi to ssh to the nodes.",
      "      It will need the values in sshexec to be configured.",
      "kubernetes: Communicate with GlusterFS containers over",
      "            Kubernetes exec api."
    ],
    "executor": "ssh",

    "_sshexec_comment": "SSH username and private key file information",
    "sshexec": {
      "keyfile": "/etc/heketi/heketi_key",
      "user": "root",
      "port": "2222"
    },

    "_db_comment": "Database file name",
    "db": "/var/lib/heketi/heketi.db",

    "_loglevel_comment": [
      "Set log level. Choices are:",
      "  none, critical, error, warning, info, debug",
      "Default is warning"
    ],
    "loglevel" : "debug"
  }
}

SSH access

The Heketi server needs passwordless SSH access to all Gluster containers on port 2222. The public key generated above needs to be added to the authorized_keys for all of the Gluster containers. Note that we have a local persistent volume (PV) for each Gluster container’s /root directory, so this authorized_key entry was simply added to each one of those.

cat /etc/heketi/heketi_key.pub >> \

/home/gluster-rhv-1-root/.ssh/authorized_keys

NOTE: This needs to be done for each of the root home directories for each Gluster container

Container

The single Heketi container will run on node1. It needs access to both of the subnets, so the best thing to do is run the container in host networking mode. It also needs a few persistent volumes.

docker run -d --net=host --name=gluster-heketi \

-v /etc/heketi:/etc/heketi:z -v /var/lib/heketi:/var/lib/heketi:z \

rhgs3/rhgs-volmanager-rhel7

Network

Since we are running heketi-cli on the same node that we are running the Heketi container, there is a security issue we have to work through. By default, the container host cannot directly access the local container via the IP assigned to its macvlan network interface. So on the container host node1 we need to create local macvlan interfaces for each of the subnets. Use this at runtime and the /etc/rc.d/rc.local file:

/usr/sbin/ip link add macvlan0 link eth1 type macvlan mode bridge
/usr/sbin/ip addr add 192.168.99.228/24 dev macvlan0
/usr/sbin/ifconfig macvlan0 up

/usr/sbin/ip link add macvlan1 link eth1.199 type macvlan mode bridge
/usr/sbin/ip addr add 192.168.199.228/24 dev macvlan1
/usr/sbin/ifconfig macvlan1 up

The rc.local file in RHEL is for legacy support, so it has to be made executable and its systemd service has to be enabled.

chmod 755 /etc/rc.d/rc.local
systemctl enable rc-local.service

Heketi CLI

The heketi-cli needs to run $somewhere. For simplicity, the RPM is installed on node1. With the container running with networking in host mode, heketi is listening on localhost port 8080. Export the environment variable in order to be able to run heketi-cli commands.

export HEKETI_CLI_SERVER=http://localhost:8080

Setting up the Heketi clusters

A JSON file is populated at /root/heketi-rhv-plus-ocp-topology.json on node1. This file defines two separate Heketi clusters with their respective Gluster nodes (containers) and block devices.

{
    "clusters": [
        {
            "nodes": [
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.99.28"
                            ],
                            "storage": [
                                "192.168.99.28"
                            ]
                        },
                        "zone": 1
                    },
                    "devices": [
                        "/dev/sdb"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.99.29"
                            ],
                            "storage": [
                                "192.168.99.29"
                            ]
                        },
                        "zone": 2
                    },
                    "devices": [
                        "/dev/sdb"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.99.30"
                            ],
                            "storage": [
                                "192.168.99.30"
                            ]
                        },
                        "zone": 3
                    },
                    "devices": [
                        "/dev/sdb"
                    ]
                }
            ]
        },

        {
            "nodes": [
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.199.28"
                            ],
                            "storage": [
                                "192.168.199.28"
                            ]
                        },
                        "zone": 1
                    },
                    "devices": [
                        "/dev/sdc",
                        "/dev/sdd",
                        "/dev/sde",
                        "/dev/sdf",
                        "/dev/sdg"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.199.29"
                            ],
                            "storage": [
                                "192.168.199.29"
                            ]
                        },
                        "zone": 2
                    },
                    "devices": [
                        "/dev/sdc",
                        "/dev/sdd",
                        "/dev/sde",
                        "/dev/sdf",
                        "/dev/sdg"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.199.30"
                            ],
                            "storage": [
                                "192.168.199.30"
                            ]
                        },
                        "zone": 3
                    },
                    "devices": [
                        "/dev/sdc",
                        "/dev/sdd",
                        "/dev/sde",
                        "/dev/sdf",
                        "/dev/sdg"
                    ]
                }
            ]
        }
    ]
}

This file is passed (once) to Heketi to setup the two clusters.

heketi-cli topology load --json=heketi-rhv-plus-ocp-topology.json

It’s important to note the two different clusters. It’s not (AFAIK) possible to “name” the clusters, so we have to reference them by their UUIDs. The Gluster volumes for RHV will be created on one cluster, and those orchestrated for OCP PVs will be created on a different cluster.

RHV Gluster volumes

For the purposes of RHV, two volumes were requested—one for the Hosted Engine and one for the VM storage. These were created via heketi-cli. Note the cluster ID passed to the commands.

heketi-cli volume create --size 100 --name rhv-hosted-engine \

--clusters ae2a309d02781816adfed567693221a9
heketi-cli volume create --size 1024 --name rhv-virtual-machines \

--clusters ae2a309d02781816adfed567693221a9

These can be mounted to the RHV nodes via the 192.168.99.0/24 subnet using the Gluster native client. Example fstab entries:

192.168.99.28:rhv-hosted-engine      /100g   glusterfs       backupvolfile-server=192.168.99.29:192.168.99.30 0 0
192.168.99.28:rhv-virtual-machines      /1t   glusterfs       backupvolfile-server=192.168.99.29:192.168.99.30 0 0

OCP PV Gluster volumes

Our OCP pods are attached to the 192.168.199.0/24 subnet to communicate with the storage. First on node1 the Heketi API port (8080) needs to be opened in the firewall.

firewall-cmd --add-port 8080/tcp
firewall-cmd --add-port 8080/tcp --permanent

Then the storage class for OCP is defined with the below YAML. Note that we aren’t currently doing any authentication (but obviously we should). You see here that we explicitly define the Heketi cluster ID for this class in order to ensure that all volumes for PVCs are created only on the Gluster pool we have identified for OCP use.

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
 name: gluster-dyn
provisioner: kubernetes.io/glusterfs
parameters:
 resturl: "http://192.168.199.128:8080"
 restauthenabled: "false"
 clusterid: "74edade536c80f14486edfbabd204151"

Then the storage class is added to OCP on the master.

oc create -f glusterfs-storageclass.yaml

From this point, PVCs (persistent volume claims) made against this storage class will interface with Heketi to dynamically provision Gluster volumes to match the claim.

Miscellaneous

Auto-start containers

Docker container systemd init scripts are tricky. I’ve found that every example on the internet is either wrong, outdated, or uses an approach I don’t like.

Below is an example systemd service file for the Heketi container, which is simple and works the way we expect it to with the docker run command in the ExecStart (/etc/systemd/system/docker-container-gluster-heketi.service). NOTE: Do not daemonize (-d) the docker run command in the init script. Also, the SuccessExitStatus is important here.

[Unit]
Description=Gluster Heketi Container
Requires=docker.service
After=docker.service

[Service]
TimeoutStartSec=60
Restart=on-abnormal
SuccessExitStatus=0 137
ExecStartPre=-/usr/bin/docker stop gluster-heketi
ExecStartPre=-/usr/bin/docker rm gluster-heketi
ExecStart=/usr/bin/docker run --net=host --name=gluster-heketi -v /etc/heketi:/etc/heketi:z -v /var/lib/heketi:/var/lib/heketi:z rhgs3/rhgs-volmanager-rhel7
ExecStop=/usr/bin/docker stop gluster-heketi

[Install]
WantedBy=multi-user.target

Reload the systemd daemon:

systemctl daemon-reload

Enable and start the service

systemctl enable docker-container-gluster-heketi

systemctl start docker-container-gluster-heketi

Known issues and TODOs

  • Security needs to be taken into account. We’ll set up appropriate key-based authentication and JWT for Heketi. We’d also like to use role-based auth. Hopefully we’ll cover this in a future blog post.
  • Likely $other_things I haven’t realized yet, or better ways of approaching this. I’d love to hear your comments.

GlusterFS among the elite!

Score one more for Red Hat Storage! In case you didn’t hear, GlusterFS is the proud recipient of a 2015 Bossie Award, InfoWorld’s top picks in open source datacenter and cloud software. Highly influential worldwide among technology and business decision makers alike, the IDC publication selected GlusterFS as one of its top picks for 2015.

Continue reading “GlusterFS among the elite!”

Red Hat Storage gets a Facelift on Redhat.com

Screenshot

If you haven’t yet noticed by now, Red Hat just relaunched our website. And in particular, we’re super excited about the launch of the new Red Hat Storage web pages. The new Red Hat Storage product pages provides a single reference source to get the latest on storage including: Red Hat Storage Server, Inktank by Red Hat, Gluster, Ceph and Big Data solutions. We hope you will find the site to be much more informative, insightful and easier to navigate. And, since we’re the leader of open source technology, we believe in putting our money where our mouth is…the site will be powered by open source technologies such as Red Hat Enterprise Linux, Red Hat Satellite, Drupal and OpenShift, across many aspects of the site.

Continue reading “Red Hat Storage gets a Facelift on Redhat.com”

What Can a Paper Shredder Teach Us About Big Data?

by Irshad Raihan, Red Hat Storage – Big Data Product Marketing

The trusty paper shredder in my home office died last week. I’m in the market for a new one. Years ago, when I purchased “Shreddy” (of course, it had a name) after a brief conversation with a random store clerk, choices were few and information scarce. In fact, paper shredders weren’t really considered standard personal office equipment as they are today. Most good shredders were built for offices not homes. Back in the market more than a decade later, it’s clear that the search for a new shredder is going to be trickier than I had imagined.

A paper shredder is a lot like big data.

Continue reading “What Can a Paper Shredder Teach Us About Big Data?”

The Data Life Cycle Has Changed. Are You Ready?

by Irshad Raihan, Red Hat Storage – Big Data Product Marketing

Digital data has been around for centuries in one form or the other. Commercial tabulating machines have been available since the late 1800’s when they were used for accounting, inventory and population census. Why then do we label today as the Big Data age? What dramatically changed in the last 10-15 years that has the entire IT industry chomping at the bit?

More data? Certainly. But that’s only the tip of the iceberg. There are two big drivers that have contributed to the classic V’s (Volume, Variety, Velocity) of Big Data. The first is the commoditization of computing hardware – servers, storage, sensors, cell phones – basically anything that runs on silicon. The second is the explosion in the number of data authors – both machines and humans.

Continue reading “The Data Life Cycle Has Changed. Are You Ready?”

Red Hat Summit 2014, day zero recap

Although April 14 was the first day of the full Red Hat Summit, the event really kicked off April 13 with a reception in the partner pavilion (think expo hall) with…

2014-04-14 17.49.01

Continue reading “Red Hat Summit 2014, day zero recap”

Summit Spotlight: Don’t Miss These Storage Tracks and Sessions

Summit Spotlight: Don’t Miss These Storage Tracks and Sessions

Red Hat Summit kicks off this year from April 14-17th in San Francisco, CA. We’ve organized more than 150 breakout sessions, each with a unique solution-focus, but attendees of all experience levels will see a variety of products, demos, customer success stories, and more.

Continue reading “Summit Spotlight: Don’t Miss These Storage Tracks and Sessions”

Red Hat’s Approach with Open, Software-Defined Storage (A Four Part Series)

Today’s Post: Red Hat Storage Server In Action

By Steve Bohac, Red Hat Storage Product Marketing

Several weeks ago, we posted the blog “Open Software Defined Storage – Don’t Get Fooled By The False ‘Open’ and Get Locked-In Again”. Today’s entry is the conclusion of this four part mini-series.

We understand how difficult it is to optimize your storage for innovation and growth, and our goal is to help enterprises on their journey to convert their data centers from cost centers into revenue-generators. Red Hat Storage Server has helped businesses of all varieties achieve their objectives. Here’s how open, software-defined storage has helped a few organizations get to the next level:

Continue reading “Red Hat’s Approach with Open, Software-Defined Storage (A Four Part Series)”

Open Software Defined Storage – Don’t get fooled by the false “Open” and get locked-in once again

By Scott Clinton @ Redhat

Open software-defined storage is transforming the way organizations tackle their data management challenges. More and more of companies are seeing how it can open up the opportunity to significantly reduce costs, efficiently contend with the exploding data landscape, and support today’s increasingly software-defined and hybrid datacenter all while discovering the new roles and value software-based storage platforms can bring, now that open software-defined storage is enabling organization to take data, as is, out of the box.

Continue reading “Open Software Defined Storage – Don’t get fooled by the false “Open” and get locked-in once again”

The Future of GlusterFS – Slides

Yesterday I had a great time discussing the future of the GlusterFS project with a few hundred of our closest friends. I’ve posted the slides here for your viewing pleasure. I’ll post the recording here as well when it’s available. Enjoy!

Update: You can listen to the webcast recording here or download the MP4 file here.

GlusterFS HOWTO: NFS Performance with FUSE Client Redundancy

There’s another great post on the community Q&A sitethis one about NFS performance, excessive load times for PHP-based web sites, and the Fuse client. This was written by Joe Julian, our resident IRC chairman and all-around Gluster soup stirrer:

There’s been a lot of discussion about the latency due to self-heal checking with the fuse client, and how with most php based web sites the sheer volume of files that must be opened for each page load causes page times to be excessive. Common workarounds have been to use web server caches, or php caches to avoid disk reads wherever possible. When this cannot be done, the recommendation has been to use NFS as the kernel caches reduce the disk reads. The problem, of course, with NFS mounts is redundancy failure. ucarp can mitigate the problem, but there are still problems with lost tcp connections causing errors.

Another solution is to move the NFS service to the client. NFS is provided in GlusterFS by using the same client that’s used for fuse, and adding an nfs server translator instead of fuse. This lends itself very well to being moved to the client and having the client provide it’s own NFS service.

To take advantage of this technique, there are two ways of doing it.

Read the rest of Joe’s article, in full. If you’ve had performance issues trying to use GlusterFS for web site file storage, you might find this useful.

Linux Kernel Tuning for GlusterFS

One of our support engineer gurus, Harsha, has published a very detailed post on tuning parameters for the Linux kernel that may impact your GlusterFS performance. Here’s his introduction:

Every now and then, questions come up here internally and with many enthusiasts on what Gluster has to say about kernel tuning, if anything.

The rarity of kernel tuning is on account of the Linux kernel doing a pretty good job on most workloads. But there is a flip side to this design. The Linux kernel historically has eagerly eaten up a lot of RAM, provided there is some, or driven towards caching as the primary way to improve performance.

For most cases, this is fine, but as the amount of workload increases over time and clustered load is thrown upon the servers, this turns out to be troublesome, leading to catastrophic failures of jobs etc.

Having had a fair bit of experience looking at large memory systems with heavily loaded regressions, be it CAD, EDA or similar tools, we’ve sometimes encountered stability problems with Gluster. We had to carefully analyse the memory footprint and amount of disk wait times over days. This gave us a rather remarkable story of disk trashing, huge iowaits, kernel oops, disk hangs etc.

This article is the result of my many experiences with tuning options which were performed on many sites. The tuning not only helped with overall responsiveness, but it dramatically stabilized the cluster overall.

When it comes to memory tuning the journey starts with the ‘VM’ subsystem which has a bizarre number of options, which can cause a lot of confusion.

It’s a great article. Read the whole post.