Introduction

In one of our previous posts, Anatomy of the S3A filesystem client, we showed how Spark can interact with data stored in a Ceph object storage in the same fashion it would interact with Amazon S3. This is all well and good if you plan on exclusively writing applications in PySpark or Scala, but wouldn’t it be great to allow anyone who is familiar with SQL to interact with data stored in Ceph?

That’s what SparkSQL is for, and while Spark has the ability to infer schema, it’s a lot easier if the data is already described in a metadata service like the Hive Metastore. The Hive Metastore stores table schema information, statistics on tables and partitions, and generally aids the query planners of various SQL engines query planners in constructing efficient query plans. So, regardless of whether you’re using good ol’ Hive, SparkSQL, Presto, or Impala, you’ll still be storing and retrieving metadata from a centralized store. Even if your organization has standardized on a single query engine, it still makes sense to have a centralized metadata service, because you’ll likely have distinct workload clusters that will want to share at least some data sets.

Architecture

The Hive Metastore can be housed in a local Apache Derby database for development and experimentation, but a more production-worthy approach would be to use a relational database like MySQL, MariaDB, or Postgres. In the public cloud, a best practice is to store the database tables on a distinct volume to get features like snapshots, and the ability to detach and reattach it to a different instance. In the private cloud, where OpenStack reigns supreme, most folks have turned to Ceph to provide block storage. To learn more about how to leverage Ceph block storage for database workloads, I suggest taking a look at the MySQL reference architecture we authored in conjunction with the open source database experts over at Percona.

While you can configure Hive, Spark, or Presto to interact directly with the MySQL database containing the Metastore, interacting with the Hive Server 2 Thrift service provides better concurrency and an improved security posture. Overall, the general idea is depicted in the following diagram:

Storing tabular data as objects

In a greenfield environment where all data will be stored in the object store, you could simply set hive.metastore.warehouse.dir to a S3A location a la s3a://hive/warehouse. If you haven’t already had a chance to read our Anatomy of the S3A filesystem client post, you should take a look if you’re interested in learning how to configure S3A to interact with a local Ceph cluster instead of Amazon S3. When a S3A location is used as the Metastore warehouse directory, all tables that are created will default to being stored in that particular bucket, under the warehouse pseudo directory. A better approach is to utilize external locations to map databases, tables, or simply partitions to different buckets - perhaps so they can be secured with distinct access controls or other bucket policy features. An example of including a external location specification during table creation might be:

create external table inventory
(
   inv_date_sk bigint,
   inv_item_sk bigint,
   inv_warehouse_sk bigint,
   inv_quantity_on_hand int
)
row format delimited fields terminated by '|'
location 's3a://tpc/inventory';

That’s it, when you interact with this inventory table, data will be  directly read from the object store by way of the S3A filesystem client. One of the cool aspects of this approach is the location is abstracted away, you can write queries that scan tables with different locations, or even scan a single table with multiple locations. In this fashion, you might have recent data partitions with a MySQL external location, and data older than the current week in partitions with external locations that point to object storage. Cool stuff!

Serialization, partitions, and statistics

We all want to be able to analyze data sets quickly, and there are a number of tools available to help realize this goal. The first is using different serialization formats. In my discussions with customers, the two most common serialization formats are the columnar formats ORC and Parquet. The gist of these formats is that instead of requiring complete scans of entire files, columns of data are separated into stripes and metadata describing each column's stripe offsets are stored in a file header or footer. When a query is planned, requests can read in only the stripes that are relevant to that particular query. For a more on different serialization formats, and their relative performance, I highly suggest this analysis by our friends over at Silicon Valley Data Science. We have seen great performance with both Parquet and ORC when used in conjunction with a Ceph object store. Parquet tends to be slightly faster, while ORC tends to use slightly less disk space. This small delta might simply be the result of these formats using different compression algorithms by default (snappy vs ZLIB). Speaking of compression, it’s really easy to think you’re using it, when you are in fact not. Make sure to verify that your tables are actually being compressed. I suggest including the compression specification in table creation statements instead of hoping the engine you are using has the defaults configured the way you want.

In addition to serialization formats, it’s important to consider how your tables are partitioned, and how many files you have per partition. All S3 API calls are RESTful, which means they are heavier weight than HDFS RPC calls. Having fewer larger partitions, with fewer files per partition, will definitely translate into higher throughput and reduced query latency. If you already have tables with loads of partitions, and many files per partition, it might be worthwhile to consolidate them with larger partitions with a fewer files each as you move them into object storage.

With data serialized and partitioned intelligently, queries can be much more efficient, but there is a third way you can help the query planner of your execution engine do its job better - table and column statistics. Table statistics can be collected with ANALYZE TABLE table COMPUTE STATISTICS statements, which count the number of rows for a particular table and their partitions. The row counts are stored in the Metastore, and can be used by other engines that interrogate the Metastore during query planning.