Scaleout Metadata File Systems already store much of your data. What are they?

TL;DR: A new class of hierarchical distributed file system with scaleout metadata has taken over at Google, Facebook, and Microsoft, and provides a single centralized file system that manages the data for an entire data center, scaling to Exabytes in size. The common architectural feature of these systems is scaleout metadata, so we call them scaleout metadata file systems.

A longer version of this article was first published here.

Metadata: The hierarchical file system bottleneck

Image from Unsplash. Metadata is the Brain of a File System.

Hierarchical file systems typically provide well-defined behaviour (a POSIX API) for how a client can securely create, read, write, modify, delete, organize, and find files.

The data in such file systems is stored in files as blocks or extents. Distributed file systems spread and replicate these blocks/extents over many servers for improved performance and high availability. However, the data about what files, directories, blocks, and file system permissions are in the system have historically been stored in a single server called the metaserver or namenode. We call this data about the file system objects metadata. In file systems like HDFS, the namenode stores its metadata in-memory to improve both latency and throughput in the number of metadata operations it can support per second. Example metadata operations are: create a directory, move or rename a file or directory, change file permissions or ownership.

As the size of data under management by distributed file systems increased, it was quickly discovered that metadata servers became a bottleneck. For example, HDFS could scale to, at a push, a Petabyte, but not handle more than 100K reads/sec and only a few thousand writes/sec.

It has long been desired to re-architect distributed file systems to shard their metadata across many servers to enable them to support (1) larger volumes of metadata and (2) more operations/second. But it is a very hard problem.

Now, we discuss the only four scaleout metadata file systems that are publicly known today: Google Colossus, Facebook Tectonic, Microsoft ADLSv2, and Logical Clocks’ HopsFS.

Google Colossus

Even though we first heard about Colossus’ architecture in 2009 and its name in 2012, Google has been surprisingly secretive about the lowest layer of their scalable storage and compute architecture. However, after the release of Tectonic (coincidence?) in early 2021, Google released more details on Colossus in May 2021.

Image by Google: Colossus under the hood: a peek into Google’s scalable storage system

Metadata Storage System

Colossus’ metadata storage service is BigTable, which does not support cross-shard transactions. We assume this means that Colossus lacks atomic rename, a hole that is filled for tabular data (at least) by Spanner, which supports cross-shard transactions.

In Colossus, file system clients connect to curators to perform metadata operations, who, in turn, talk to BigTable. Custodians perform file system maintenance operations, and “D” services provide block storage services, where clients read/write blocks directly from/to “D” servers.

Image by Google: Colossus under the hood: a peek into Google’s scalable storage system

Different clients of Colossus can store their data on different volumes (metadata shards). We do not know if atomic rename is possible, although it is unlikely, as BigTable does not support multi-row transactions within a tablet.

Facebook Tectonic

Tectonic was first announced as a file system at USENIX Fast 2021, and it unifies Facebook’s previous storage services (federated HDFS, Haystack, and others) to provide a data-center scale file system.

Metadata Storage System

Similar to Colossus, Tectonic stores its metadata in a key-value store, but in this case in ZippyDB. As ZippyDB lacks cross-partition transactions, cross-namespace file system operations are not supported. That is, you cannot atomically move a file from one volume (metadata shard) to another. Often, such operations are not needed, as all the data for a given service can fit in a single namespace, and there are no file system operations between different applications. There are separate stateless services to manage the name space, blocks, files, and file system maintenance operations.

Image from USENIX Fast paper: Facebook’s Tectonic Filesystem: Efficiency from Exascale

Microsoft ADLSv2

Azure Data Lake Storage (ADLS) was first announced at Sigmod 2017 and it supports Hadoop distributed file system (HDFS) and Cosmos APIs. It has since been redesigned as Azure Data Lake Gen 2 (ADLSv2) that provides multi-protocol support to the same data using the Hadoop File System API, the Azure Data Lake Storage API and the Azure Blob storage API. Unlike Colossus and Tectonic, it is available for use as a service — but only on Azure.

Metadata Storage System

The most recent information about ADLS’ architecture is the original paper describing ADLS from 2017 — no architecture has been published yet for ADLSv2. However, ADLS used RSL-HK to store metadata and it has a key-value store (ring) with shards using state machine replication (Paxos) and with transactions across shards, al in an in-memory engine (“It implements a novel combination of Paxos and a new transactional in-memory block data management design.”).

Image from Sigmod Paper: Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics


HopsFS was first announced at USENIX Fast 2017 and provides a HDFS API. HopsFS is a rewrite of HDFS and it supports multiple stateless namenode (metadata servers), where the leader performs file system maintenance operations, and a pluggable metadata storage layer.

Metadata Storage System

HopsFS provides a DAL API to support different metadata storage engines. Currently the default engine for HopsFS is RonDB (a fork of NDB Cluster, the storage engine for MySQL Cluster), a scalable key-value store with SQL capabilities. RonDB can scale to handle hundreds of millions of transactional reads per second and 10s of millions of transactional writes per second and it provides both a native key-value API and a SQL API via a MySQL Server. RonDB also provides a CDC (change-data-capture) API to allow us to automatically replicate changes in metadata to Elasticsearch, providing a free-text search API to HopsFS’ metadata (including its extended metadata). Metadata can be queried using any of the 3 APIs: the native key-value API for RoNDB, the SQL API, or using free-text search in Elasticsearch.

Image by Author. RonDB: LATS (low Latency, high Availability, high Throughput, scalable Storage).

HopsFS scales the Namespace Layer with RonDB and Stateless Namenodes, while the block layer is cloud object storage.

Comparing the Scaleout Metadata Architectures

When sharding the state of the metadata server over many servers, you need to make decisions about how to do it. Google used its existing BigTable key-value store to store Colossus’ metadata. Facebook, similarly, chose the ZippyDB key-value store for Tectonic. Microsoft built their own Replicated State Library — Hekaton Ring Service (RSL-HK) to scale-out ADLS’ metadata. The RSL-HK ring architecture combines Paxos-based metadata with Hekaton (in-memory engine from SQL Server). HopsFS used NDBCluster (now RonDB) to scale out its metadata.

The capabilities of these underlying storage engines are reflected in the semantics provided by the higher level file systems. For example, Tectonic and (probably) Colossus do not support atomic move of files from any directory to any other directory. Their key-value stores do not support agreement protocols across shards (only within a shard). So, at the file system level, you introduce an abstraction like a file system volume (Tectonic calls them tenants), and users then know they can perform atomic rename/move within that volume, but not across volumes. Google solves this problem at a higher layer for structured data with Spanner by implementing two-phase commit transactions to ensure consistency across shards. In contrast, RSL-HK Ring by Microsoft and RonDB by Logical Clocks support cross-shard transactions that enable both ADLSv2 and HopsFS to support atomic rename/move between any two paths in the file system.

To put this in database terms, the consistency models provided by the scaleout metadata file systems are tightly coupled to the capabilities provided by the underlying metadata store. If the store does not support cross-partition transactions — consistent operations across multiple shards, you will not get strongly consistent cross-partition file system operations. For example, if the metadata store is a key-value store, where each shard typically maintains strongly consistent key-value data using Paxos. But Paxos do not compose — you cannot run Paxos between two shards that themselves maintain consistency using Paxos. In contrast, RonDB supports 2-phase commit (2PC) across shards, enabling strongly consistent metadata operations both within shards and across shards.

Once a scaleout metadata storage layer is in place, stateless services can be used to provide access control and implement background maintenance tasks like maintaining the durability and availability of data, disk space balancing, and repairing blocks.

Deja-vu all over again

The journey from a stronger POSIX-like file system to a weaker object storage paradigm and back again has parallels in the journey that databases have made in recent years. Databases made the transition from strongly consistent single-host systems (relational databases) to highly available (HA), eventually consistent distributed systems (NoSQL systems) to handle the massive increases in data managed by databases. However, NoSQL is just too hard for developers, and databases are returning to strongly consistent (but now scalable) NewSQL systems, with databases such as Spanner, CockroachDB, SingleSQL, and NDB Cluster.

The scaleout metadata file systems, introduce here, show that distributed hierarchical file systems are completing a similar journey, going from strongly consistent POSIX-compliant file systems to object stores (with their weaker consistency models), and back to distributed hierarchical file systems that are have solved the scalability problem by redesigning the file system around a mutable, scaleout metadata service.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jim Dowling

Jim Dowling

@jim_dowling CEO of Logical Clocks AB. Associate Prof at KTH Royal Institute of Technology Stockholm, and Senior Researcher at RISE SICS.