This is the first post of a multi-part series of technical blog posts on Spark on Ceph:
- What about locality?
- Anatomy of the S3A filesystem client
- To the cloud!
- Storing tables in Ceph Object Storage
- Comparing with HDFS—TestDFSIO
- Comparing with remote HDFS—Hive Testbench (SparkSQL)
- Comparing with local HDFS—Hive Testbench (SparkSQL)
- Comparing with remote HDFS—Hive Testbench (Impala)
- Interactive speedup
- AI and machine learning workloads
- The write firehose
Without fail, every time I stand in front of a group of people and talk about using an object store to persist analytics data, someone stands up and makes a statement along the lines of:
“Will performance suck because the benefits of locality are lost?”
It’s not surprising—We’ve all been indoctrinated by the gospel of MapReduce for over a decade now. Let’s examine the historical context that gave rise to the locality optimization and analyze the advantages and disadvantages.
Google published the seminal GFS and MapReduce papers in 2003 and 2004 and showed how to build reliable data processing platforms from commodity components. The landscape of hardware components then was vastly different from what we see in contemporary datacenters. The specifications of the test cluster used to test MapReduce, and the efficacy of the locality optimization, were included in the slide material that accompanied the OSDI MapReduce paper.
Cluster of 1800 machines, [each with]:
- 4GB of memory
- Dual-processor 2 GHz Xeons with hyperthreading
- Dual 160GB IDE disks
- Gigabit Ethernet per machine
- Bisection bandwidth of 100 Gb/s
If we draw up a wireframe with speeds and feeds of their distributed system, we can quickly identify systemic bottlenecks. We’ll be generous and assume each IDE disk is capable of data transfer rate of 50 MB/s. To determine the available bisectional bandwidth per host, we’ll divide the cluster wide bisectional bandwidth by the number of hosts.
The aggregate throughput of the disks roughly matches the throughput of the host network interface, a quality that’s maintained with contemporary hadoop nodes from today with 12 SATA disks and a 10GbE network interface. After we leave the host and arrive at the network bisection, the challenge facing Google engineers is immediately obvious: a network oversubscription of 18 to 1. In fact, this constraint alone lead to the development of the MapReduce locality optimization.
Networking equipment in 2004 was only available from a handful of vendors, due largely to the fact that vendors needed to support the capital costs of ASIC research and development. In the subsequent years, this began to change with the rise of merchant silicon and, in particular, the widespread availability of switching ASICs from the likes of Broadcom. Network engineers quickly figured out how to build network fabrics with little to no oversubscription, evidenced by a paper published by researchers from UC San Diego at the Hot Interconnects Symposium in 2009. The concepts of this paper have since seen widespread implementation in datacenters around the world. One implementation, notable for its size and publicity, would be the next-generation data fabric used in Facebook’s Altoona facility.
While networking engineers were furiously experimenting with new hardware and fabric designs, distributed storage and processing engineers were keeping equally busy. Hadoop spun out of the Nutch project in 2006. Hadoop then consisted of a distributed filesystem modeled after GFS, called Hadoop distributed filesystem (HDFS), and a MapReduce implementation. The Hadoop framework included the locality optimization described in the MapReduce paper.
When the aggregate throughput of the storage media on each host is greater than the host’s available network bandwidth, or the host’s portion of bisectional network bandwidth, jobs can be completed faster with the locality optimization. If the data is being read from even faster media, perhaps DRAM by way of the host’s page cache, then locality can be hugely beneficial. Practical examples of this might be iterative queries with MPP engines like Impala or Presto. These engines also have workers ready to process queries immediately, which removes latencies associated with provisioning executors by way of a scheduling system like YARN. In some cases, these scheduling delays can dampen the benefits of locality.
Simply put, the locality optimization is predicated on the ability to move computation to the storage. This means that compute and storage are coupled, which leads to a number of disadvantages.
One key example are large, multi-tenant clusters with shared resources across multiple teams. Yes, YARN has the ability to segment workloads into distinct queues with different resource reservations, but most of the organizations I’ve spoken with have complained that even with these abilities it’s not uncommon to see workloads interfere with each other. The result? Compromised service level objectives and/or agreements. This typically leads to teams requesting multiple dedicated clusters, each with isolated compute and storage resources.
Each cluster typically has vertically integrated software versioning challenges. For example, it’s harder to experiment with the latest and greatest releases of analytics software when storage and analytics software are packaged together. One team’s pipeline might rely on mature components, for whom an upgrade is viewed as disruptive. Another team might want to move fast to get access to the latest and greatest versions of a machine learning library, or improvements in query optimizers. This puts data platform operations staff in a tricky position. Again, the result is typically workload dedicated clusters, with isolated compute and storage resources.
In a large organization, it’s not uncommon for there to be a myriad of these dedicated clusters. The nightmare of capacity planning each of these clusters, duplicating data sets between them, keeping those data sets up to date, and maintaining the lineage of those data sets would make for a great Stephen King novel. At the very least, it might encourage an ecosystem of startups aimed at easing those operational hardships.
In the advantages section, I discussed scheduler latency. The locality optimization is predicated on the scheduler’s ability to resolve constraints—finding hosts that can satisfy the multi-dimensional constraints of a particular task. Sometimes, the scheduler can’t find hosts that satisfy the locality constraint with sufficient compute and memory resources. In the case of the Fair Scheduler, this translates to a scheduling delay that can impact job completion time.
Datacenter network fabrics are vastly different than they were in 2004, when the locality optimization was first detailed in the MapReduce paper. Both public and private clouds are supported by fat tree networks with low or zero oversubscription. Tenants’ distributed applications with heavy east-west traffic patterns demand nothing less. In Amazon, for example, instances that reside in the same placement group of an availability zone have zero oversubscription. The rise of these modalities has made locality much less relevant. More and more companies are choosing the flexibility offered by decoupling compute and storage. Perhaps we’re seeing the notion of locality expand to encompass the entire datacenter, reimagining the datacenter as a computer.