I recently announced the availability of the NOBACKUP storage service at DPB and DGE, culminating several months of research, design, building, and testing.  I personally find the story behind building NOBACKUP to be an interesting one, and figured I’d share how it came to be.

Purpose & Technology

Many times I’ve been asked by our userbase where they can store datasets that are downloaded from external authoritative sources for analysis, or where to store temporary & intermediate data during active computation.  Since this data can either be re-downloaded from the authoritative source, or isn’t part of the final result to be provided as part of final publication, there’s no need to ensure this data is backed up.  On Carnegie’s Memex cluster, there’s a dedicated scratch space that’s perfect for this type of data, but we haven’t had a similar in-house solution that could be utilized by all of our internal computation systems for this purpose.  Given this need, the NOBACKUP project began.

Since the primary focus of NOBACKUP is on data used during active computation, typical considerations like backups and disaster recovery can be put aside and the system can be optimized for maximum throughput.  This is also why the system is named NOBACKUP, just to be sure all users of the system are aware that there aren’t any of the typical protections in place to protect against accidental deletion, overwrite, or system failure.  It’s also the same name that NASA’s Advanced Supercomputing (NAS) Division gives to the scratch storage on their systems, and imitation is said to be the best form of flattery.

Typically in HPC environments, high-throughput can be achieved through parallelization.  This applies to storage as much as it does to computation through the use of parallel filesystems.  Parallel filesystems are designed to efficiently handle multiple simultaneous I/O requests from users and processes, distributing the I/O load across multiple storage systems as much as possible.  Lustre is one of the most commonly utilized parallel filesystem in HPC, and is something we have in-house experience with from the Lustre based scratch space on Carnegie’s Memex cluster.

For our in-house storage systems, I’ve heavily utilized the ZFS filesystem as it provides a multitude of modern features including snapshots and transparent compression.  The snapshotting capability works well for enabling users to be able to self-restore either deleted or overwritten data.  And the transparent compression capabilities allow us to squeeze a significantly larger amount of data onto our storage infrastructure than would otherwise be possible (typically ~1.5-1.7x compression ratio).  ZFS compression also has the advantage of accelerating read throughput by reading less data from disk (compressed data) than that total delivered to clients (uncompressed data).

If only Lustre and ZFS could work together, then we could have a solution that would provide the best of both worlds.  Turns out, you can do just that.

Design

NOBACKUP.png

NOBACKUP is a combination of storage servers and technologies working together to achieve high-throughput I/O and to ensure access regardless of the network technology or physical location of the client.  NOBACKUP is primarily a Lustre based parallel filesytem, combining multiple storage servers (1 Management, 1 Metadata, 2x Data/Object), each with one or more compressed ZFS filesystems (2x per Data/Object Server).  High-Availability and fault tolerance of the servers is handled by our VMware vSphere infrastructure (vSphere HA) as all of the servers are actually virtual machines within that environment. All of the ZFS filesystems are backed by an iSCSI storage system.  The storage system has a pair of of active/active redundant controllers, each controller with a pair of 10Gb interfaces.  In addition to the usual RAID arrays, the system has both an all-flash SSD array for high-IOPS storage needs, and an SSD read cache to accelerate reads of “hot” data.  Connected as a client to the Lustre filesystem, an NFS/SMB server acts as a “bridge” to ensure users can get data in and out of the filesystem without direct Lustre connectivity.

Beta Build Surprise

During initial tests of the beta build, a relatively major and surprising design flaw was revealed.  The hardware being used for NOBACKUP is the same as our other in-house storage solutions, and has provided rock solid performance for our various storage servers. Given that the hardware has four 10Gb interfaces (2x controllers, 2x per controller), the storage could theoretically push 40Gbps (~4GB/s) if every interface were able to be saturated.  However, each storage server backed by this hardware communicates with clients over a virtual 10Gb interface, capping the potential throughput per storage system at 10Gbps (~1GB/s).  The parallelism introduced by Lustre allows NOBACKUP to overcome this limitation.  So, when I started the initial parallel I/O test, I was excited to see just how quickly we could read and write with this limitation removed.

IOR - Beta Stats

So… Write speeds were able to achieve ~2.6GB/s, much better than the former maximum of 1GB/s.  However, read speeds were bottlenecked at 1GB/s.  This was baffling at first.  Shouldn’t the SSD cache accelerate reads and allow reads to match or exceed writes?  Was something misconfigured and breaking during the test?  The problem turned out to actually be the SSD cache itself.  The cache initially consisted of a pair of SATA SSDs.  Each of these SSD’s had a maximum throughput of 550MB/s, capping reads from the SSD cache at 1.1GB/s.  We’d never noticed this bottleneck before.  The access time and IOPS of the SSD cache made the unit feel blazing fast, and since ~1GB/s was the maximum expected throughput prior to our parallel tests, everything had checked out in earlier tests.  Needless to say, the pair of “slow” SSDs were repurposed and replaced with four significantly faster drives to overcome this limitation.

Final Build Tests

To test the final build, multiple IOR benchmarks were run on NOBACKUP and our other storage systems to make sure the NOBACKUP was working, to see how well it scaled with parallel I/O, and to compare its performance level with already deployed storage systems.  Testing started by getting a baseline for I/O throughput from a single process.

IOR - 1 task

With just one process reading and writing from NOBACKUP, there’s no real gain compared to our existing storage systems.  And while reads from the NFS/SMB gateway are on par with other systems, writes were approximately 50% slower than our existing NFS storage system, Data.  This slowdown is likely some kind of overhead or interaction between the bridge handling incoming data as server while simultaneously writing that same data as a Lustre client.

So… performance with a single read/write isn’t great.  How about scaling things up with 4 reads/writes to see parallelization in action.

IOR - 4 task

Now NOBACKUP starts to shine.  Lustre connected NOBACKUP clients achieve 2-5x the I/O compared to Data, and are even beating out clients connected to the significantly larger (but older) Lustre system on Memex.  NFS/SMB Gateway writes are still relatively slow, but aggregate throughput increased as the number of clients increased.

Let’s keep scaling up.  32 simultaneous reads/writes this time.

IOR - 32 task

NOBACKUP is still doing very well, but total throughput has dropped slightly with increased contention for NOBACKUP’s four storage arrays.  However Memex’s Lustre system has continued to scale, likely due to its enormous size relative to NOBACKUP.  Memex’s Lustre Data/Object servers have 6x CPU, 16x memory, and 6x storage arrays compared to NOBACKUP.

So, let’s go even bigger.  A heads-up comparison between NOBACKUP and the Lustre system on Memex, under a crushing 96 simultaneous reads/writes.

IOR - 96 task

Under this much load, NOBACKUP starts to show some stress.  Surprisingly reads are suffering rather than writes, but I suspect this may be caused by too little available memory on the NOBACKUP Data/Object servers to be able to efficiently handle the load from decompressing incoming compressed data from ZFS.  Memex’s Lustre is still holding steady thanks to the brute force behind its raw size.

Given this final result, a memory upgrade for the system may be in the works if we start to see significant utilization of NOBACKUP.

P.S. How *NOT* to Benchmark

One fun note to wrap up.  When benchmarking a storage solution, it’s critical to know what parts of the storage stack your benchmark is actually touching.  All tested storage systems have read caches that are intended to speed up reads of “hot” data.  In fact, each storage system has multiple layers of caches, each one trying to alleviate the load on the deeper layers.

Caches (that I know of… there could be more!)

Data NOBACKUP Memex Lustre
System: NFS Client
Type: NFS Cache
Storage: Memory
System: Lustre Client
Type: Kernel Page Cache
Storage: Memory
System: Lustre Client
Type: Kernel Page Cache
Storage: Memory
System: NFS Server
Type: Kernel Page Cache
Storage: Memory
System: Lustre OSS Server
Type: Kernel Page Cache
Storage: Memory
System: Lustre OSS Server
Type: Kernel Page Cache
Storage: Memory
System: NFS Server
Type: ZFS ARC
Storage: Memory
System: Lustre OSS Server
Type: ZFS ARC
Storage: Memory
System: Storage Array
Type: Controller Cache
Storage: Memory
System: Storage Array
Type: Controller Cache
Storage: Memory
System: Storage Array
Type: Controller Cache
Storage: Memory
 
System: NFS Server
Type: ZFS L2ARC
Storage: SSD
System: Storage Array
Type: SSD Read Cache
Storage: SSD
 
System: Storage Array
Type: SSD Read Cache
Storage: SSD
   

If you don’t set your test dataset size to be large enough, everything may fit into a cache, and you’ll end up only benchmarking a cache.  For example, the graph below is from my very first run of IOR after compiling it.  I wanted a “quick” proof of concept, just to make sure the code ran and worked as expected, so I ran a 4 process job with each job reading/writing 1GB of data (4GB total).

IOR - lolcache

Well… it runs, yes.  It also flew at super speed!  13-17GBps (104-136Gbps) reads are not aggregate disk speeds.  They’re not even SSD speeds.  In fact, they’re faster than the theoretical maximum aggregate network bandwidth between the storage servers and clients.  The data read in this benchmark came directly out of the client’s cache, from their own memory.  While this is cool to see, it doesn’t provide a good measurement of how the different storage systems compare to each other.  Naturally, all final tests utilized datasets larger than the caches to ensure a fair comparison between systems that would be relevant to scientific workloads.