Disk Pooling in Linux with mergerFS

Disk Pooling in Linux with mergerFS

Remember the old days when we used to marvel at disk drives that measured in the hundreds of megabytes? In retrospect it seems funny now, but it wasn’t uncommon to hear someone mutter, “Man, we’ll never fill that thing up.”

If you don’t remember that, you probably don’t recall life before iPhones either, but we old timers can assure you that, once upon a time, hard drives were a mere fraction of the size they are today. Oddly enough, though, you’ll still hear the same old tripe, even from fellow IT folks. The difference now, however, is that they’re holding a helium-filled 10TB drive in their hands. But just like yesteryear, we’ll fill it up, and it’ll happen faster than you think.

In the quest to build a more scalable home IT lab, we grew tired of the old paradigm of building bigger and bigger file servers and migrating hordes of data as we outgrew the drive arrays. Rather than labor with year-over-year upgrades, we ultimately arrived at a storage solution that we feel is the best compromise of scalability, performance, and cost, while delivering SAN-NAS hybrid flexibility.

We’ll detail the full storage build in another article, but for now we want to focus specifically on the Linux filesystem that we chose for the NAS component of our storage server: mergerFS.

What is mergerFS?

Before we get into the specifics regarding mergerFS, let’s provide some relevant definitions:

Filesystem – Simply put, a filesystem is a system of organization used to store and retrieve data in a computer system. The filesystem manages the space used, filenames, directories, and file metadata such as allocated blocks, timestamps, ACLs, etc.

Union Mount – A union mount is a method for joining multiple directories into a single directory. To a user interfacing with the union, the directory would appear to contain the aggregate contents of the directories that have been combined.

FUSE – Filesystem in Userspace. FUSE is software that allows non-privileged users to create and mount their own filesystems that run as an unprivileged process or daemon. Userspace filesystems have a number of advantages. Unlike kernel-space filesystems, which are rigorously tested and reviewed prior to acceptance into the Linux Kernel, userspace filesystems, since they are abstracted away from the kernel, can be developed much more nimbly. It’s also unlikely that any filesystem that did not address mainstream requirements would be accepted into the kernel, so userspace filesystems are able to address more niche needs. We could go on about FUSE, but suffice it to say that userspace filesystems provides unique opportunities for developers to do some cool things with filesystems that were not easily attainable before.

So what is mergerFS? MergerFS is a union filesystem, similar to mhddfs, UnionFS, and aufs. MergerFS enables the joining of multiple directories that appear to the user as a single directory. This merged directory will contain all of the files and directories present in each of the joined directories. Furthermore, the merged directory will be mounted on a single point in the filesystem, greatly simplifying access and management of files and subdirectories. And, when each of the merged directories themselves are mount points representing individual disks or disk arrays, mergerFS effectively serves as a disk pooling utility, allowing us to group disparate hard disks, arrays or any combination of the two. At Teknophiles, we’ve used several union filesystems in the past, and mergerFS stands out to us for two reasons: 1) It’s extremely easy to install, configure, and use, and 2) It just plain works.

An example of two merged directories, /mnt/mergerFS01 and /mnt/mergerFS02, and the resultant union.

mergerFS vs. LVM

You might be asking yourself, “why not just use LVM, instead of mergerFS?” Though the Logical Volume Manager can address some of the same challenges that mergerFS does, such as creating a single large logical volume from numerous physical devices (disks and/or arrays), they’re quite different animals. LVM sits just above the physical disks and partitions, while mergerFS sits on top of the filesystems of those partitions. For our application, however, LVM has one significant drawback: When multiple physical volumes make up a single volume group and a logical volume is carved out from that volume group, it’s possible, and even probable, that a file will be comprised of physical extents from multiple devices. Consider the diagram below.

The practical concern here is that should an underlying physical volume in LVM fail catastrophically, be it a single disk or even a whole RAID array, any data on, or that spans, the failed volume will be impacted. Of course, the same is true of mergerFS, but since the data does not span physical volumes, it’s much easier to determine which files are impacted. There’s unfortunately no easy way that we’ve found to determine which files are located on which physical volumes in LVM.

Flexibility & Scalability

As we’ve alluded to a few times, mergerFS doesn’t care if the underlying physical volumes are single disks or RAID arrays, or LVM volumes. Since the two technologies operate at different levels with respect to the underlying disks, nothing prevents the use of LVM with mergerFS. In fact, we commonly use the following formula to create large mergerFS disk pools: multiple mdadm 4-Disk RAID10 arrays > LVM > mergerFS. In our case, the LVM is usually just used for management, we typically do not span multiple physical volumes with any volume groups, though you easily could.

This gives you incredible flexibility and scalability – want to add an individual 5400 RPM 4TB disk to an existing 12TB RAID6 array comprised of 6+2 7200 RPM 2TB drives, for 16TB total? No problem. Want to later add an LVM logical volume that spans two 8TB 2+2 RAID10 arrays for another 16TB. MergerFS is fine with that, too. In fact, mergerFS is completely agnostic to disk controller, drive size, speed, form factor, etc. With mergerFS, you can grow your storage server as you see fit.

mergerFS & Your Data

One interesting note about mergerFS is that since it is just a proxy for your data, it does not manipulate the data in any way. Prior to being part of a mergerFS union, each of your component disks, arrays, and logical volumes will already have a filesystem. This makes data recovery quite simple – should a mergerFS pool completely crash (though, unlikely), just remove the component storage devices, drop them a compatible system, mount as usual, and access your data.

What’s more, you can equally as easily add a disk to mergerFS that already has data on it. This allows you to decide at some later point if you wish to add an in-use disk to the mergerFS pool (try that with LVM). The existing data will simply show up in the mergerFS filesystem, along with whatever data is on the other volumes. It just doesn’t get any more straightforward!

mergerFS & Samba

As we stated earlier, we selected mergerFS for the NAS component of our Teknophiles Ultimate Home IT Lab storage solution. Since this is a NAS that is expected to serve files to users in a Windows Domain, we also run Samba to facilitate the Windows file sharing. Apparently, there are rumblings regarding issues with mergerFS and Samba, however, according to the author of mergerFS, this is likely due to improper Samba configuration.

Here at Teknophiles, we can unequivocally say that in server configurations based on our, “Linux File Servers in a Windows Domain,” article, Samba is perfectly stable with mergerFS. In fact, in one mergerFS pool, we’re serving up over 20TB of data spread over multiple mdadm RAID arrays. The server in question is currently sitting at 400 days of uptime, without so much as a hiccup from Samba or mergerFS.

Installing mergerFS

OK, so that’s enough background, now let’s get to the fun part. To install mergerFS, first download the latest release for your platform. We’re installing this on an Ubuntu 14.04 LTS server, so we’ll download the Trusty 64-bit .deb file. Once downloaded, install via the Debian package manager.

Creating mergerFS volumes

Now we’re going to add a couple of volumes to a mergerFS pool. You can see here that we have a virtual machine with two 5GB virtual disks, /dev/sdb and /dev/sdc. You can also see that each disk has an ext4 partition.

Next, create the mount points for these two disks and mount the disks in their respective directories.

Now we need to create a mount point for the union filesystem, which we’ll call ‘virt’ for our virtual directory.

And finally, we can mount the union filesystem. The command follows the syntax below, where <srcmounts> is a colon delimited list of directories you wish to merge.

mergerfs -o<options> <srcmounts> <mountpoint>

Additionally, you can also use globbing for the source paths, but you must escape the wildcard character.

There are numerous additional options that are available for mergerFS, but the above command will work well in most scenarios. From the mergerFS man page, here’s what the above options do:

defaults: a shortcut for FUSE’s atomic_o_trunc, auto_cache, big_writes, default_permissions, splice_move, splice_read, and splice_write. These options seem to provide the best performance.

allow_other: a libfuse option which allows users besides the one which ran mergerfs to see the filesystem. This is required for most use-cases.

use_ino: causes mergerfs to supply file/directory inodes rather than libfuse. While not a default it is generally recommended it be enabled so that hard linked files share the same inode value.

fsname=name: sets the name of the filesystem as seen in mount, df, etc. Defaults to a list of the source paths concatenated together with the longest common prefix removed.

You can now see that we have a new volume called, “mergerFS,” which is the aggregate 10GB, mounted on /mnt/virt. This new mount point can be written to, used in local applications, or served up via Samba just as any other mount point.

Other Options

Although getting a little into the weeds, it’s worth touching on an additional option in mergerFS that is both interesting and quite useful. The FUSE function policies determine how a number of different commands behave when acting upon the data in the mergerFS disk pool.

func.<func>=<policy>: sets the specific FUSE function’s policy. See below for the list of value types. Example: func.getattr=newest

Below we can see the FUSE functions and their category classifications, as well as the default policies.

Category FUSE Functions
action chmod, chown, link, removexattr, rename, rmdir, setxattr, truncate, unlink, utimens
create create, mkdir, mknod, symlink
search access, getattr, getxattr, ioctl, listxattr, open, readlink
N/A fallocate, fgetattr, fsync, ftruncate, ioctl, read, readdir, release, statfs, write
Category Default Policy
action all
create epmfs
search ff

To illustrate this a bit better, let’s look at an example. First, let’s consider file or directory creation. File and directory creation fall under the “create” category. Looking at the default policy for the create category we see that it is called, “epmfs.” From the man pages, the epmfs policy is defined as follows:

epmfs (existing path, most free space)
Of all the drives on which the relative path exists choose the drive with the most free space. For create category functions it will exclude readonly drives and those with free space less than min-freespace. Falls back to mfs.

Breaking this down further, we can see that epmfs is a “path-preserving policy,” meaning that only drives that have the existing path will be considered. This gives you a bit of control over where certain files are placed. Consider, for instance if you have four drives in your mergerFS pool, but only two of the drives contain a directory called /pictures. When using the epmfs policy, only the two drives with the pictures directory will be considered when you copy new images to the mergerFS pool.

Additionally, the epmfs policy will also serve to fill the drive with the most free space first. Once drives reach equal or near-equal capacities, epmfs will effectively round-robin the drives, as long as they also meet the existing path requirement.

There are a number of equally interesting policies, including ones that do not preserve paths (create directories as needed), fill the drive with the least free space (i.e. fill drive1, then drive2, etc.), or simply use the first drive found. Though the defaults will generally suffice, it’s a good idea to become familiar with these policies to ensure that your mergerFS configuration best suits your needs.

Adding Volumes

Similar to creating a mergerFS pool, adding disks is also quite simple. Let’s say, for instance, you want to also use the space on your existing operating system disk for the mergerFS pool. We can simply create another directory in the root filesystem to use in the union. We created ours in /mnt for demonstration purposes, but your home directory might equally suit.

Next, we unmount our existing mergerFS pool and remount it including the new directory.

Notice now our total mergerFS volume space is 19GB – the original 10GB from our two 5GB disks, plus the 9GB from /dev/sda2. And now, since we now have truly disparate volumes, let’s test our default epmfs policy. Start by creating three files in our mergerFS pool mount:

Based on our expectations of how epmfs works, we should see these files in the /mnt/mergerFS00 folder, since /dev/sda1 has the most free space.

Sure enough, this appears to work as we anticipated. Now let’s create a folder on one disk only.

Replicating our previous experiment, we’ll create a few more files, but this time in the pics directory in our mergerFS pool.

Since the epmfs policy should preserve the path pics/, and that path only exists on /mnt/mergerFS01, this is where we expect to see those files.

Removing Volumes

Removing a volume from the mergerFS pool follows the same procedure as adding a drive. Simply remount the pool without the path you wish to remove.

Notice now file1, file2, and file2 are no longer present, since they were located on /dev/sda2, which has been removed. Additionally, our total space is now back to its previous size.

mergerFS & fstab

Typically, you’re going to want your mergerFS pool to be persistent upon reboot. To do this we can simply leverage fstab as we would for any mount point. Using our example above, fstab should follow the following format:

Performance Considerations

We would be remiss to end this article without disucssing the possible performance implications of mergerFS. As with any disk pooling utility, a possible weakness of this type of this configuration is a lack of striping across drives. In RAID or LVM configurations, striping may be used to take advantage of simultaneous I/O and throughput of the available spindles. RAID level, array size, and LVM configuration play dramatically into exactly what this looks like, but even a 6+2 RAID6 array with commodity drives can achieve read speeds that will saturate a 10Gbps network. Use LVM to stripe across multiple arrays, and you can achieve stunning performance. If you’re only using single disks in your mergerFS pool, however, you’ll always be limited to the performance of a single drive. And maybe that’s OK, depending on your storage goals. Of course, careful planning of the disk subsystem and using disk arrays in your mergerFS pool can give you the best of both worlds – excellent performance and the flexibility and scalability of disk pooling.

Lastly, it’s worth noting that mergerFS is yet another layer on top of your filesystems and FUSE filesystems by their very nature add some overhead. This overhead is generally negligible, however, especially in low to moderate I/O environments. You might not want to put your busy MySQL database in userspace, but you’ll likely not notice the difference in storing your family picture albums there.

Reclaim Linux Filesystem Reserved Space

Reclaim Linux Filesystem Reserved Space

As IT Pros, we have a myriad of tools available to us to configure and tweak and tune the systems we manage. So much so, there are often everyday tools right under our noses that might have applications we may not immediately realize. In a Linux environment, tune2fs is an indispensable tool, used to tune parameters on ext2/ext3/ext4 filesystems. Most Linux sysadmins that have used mdadm software RAID will certainly recognize this utility if they’ve ever had to manipulate the stride size or stripe width in an array.


First, Let’s take a look at the disks on an Ubuntu file server so we can see what this tool does.

Now, we can use the tune2fs with the -l option to list the existing parameters of the filesystem superblock on /dev/sdb1.

Reserved Blocks?

As you can see, there are a number of parameters from the filesystem that we can view, including a number that can be tuned with tune2fs. In this article however, we’re going to focus on a rather simple and somewhat innocuous parameter – reserved block count. Let’s take a look at that parameter again:

At first glance, it isn’t obvious what this parameter means. In fact, I’ve worked with Linux sysadmins with years of experience that weren’t aware of this little gem. To understand this parameter, we probably have to put it’s origins in a bit of context. Once upon a time, SSDs didn’t exist, and no one knew what a terabyte was. In fact, I remember shelling out well north of a $100 for my first 20GB drive. To date myself even further, I remember the first 486-DX PC I built with my father in the early ’90s, and it’s drive was measured in megabytes. Crazy, I know. Since drive space wasn’t always so plentiful, and the consequences of running out of disk space on the root partition in a Linux system are numerous, early filesystem developers did something smart – they reserved a percentage of filesystem blocks for privileged processes. This ensured that even if disk space ran precariously low, the root user could still log in, and the system could still execute critical processes.

That magic number? Five percent.

And while five percent of that 20GB drive back in 1998 wasn’t very much space, imagine that new 4-disk RAID1/0 array you just created with 10TB WD Red Pros. That’s five percent of 20TB of usable space, or a full terabyte. You see, though this was likely intended for the root filesystem, by default this setting applies to every filesystem created. Now, I don’t know about you, but at $450 for a 10TB WD Red Pro, that’s not exactly space I’d want to throw away.

We Don’t Need No Stinking Reserved Blocks!

The good news, however, is that space isn’t lost forever. If you forget to initially set this parameter when you create the filesystem, tune2fs allows you to retroactively reclaim that space with the -m option.

Here you can see we’ve set the reserved blocks on /dev/sdb1 to 0%. Again, this isn’t something you’d want to do on a root filesystem, but for our “multimedia” drive, this is fine – more on that later. Now, let’s look at our filesystem parameters once again.

Notice now that our reserved blocks is set to zero. Finally, let’s have a look at our free disk space to see the real world impact. Initially, we had 50GB of 382GB free. Now we can see that, although neither the size of the disk nor the amount of used space has changed, we now have 69GB free, reclaiming 19GB of space.

Defrag Implications

Lastly, I’d be remiss if I didn’t mention that there’s one other function these reserved blocks serve. As always, in life there’s no such thing as a free lunch (or free space, in this case). The filesystem reserved blocks also serve to provide the system with free blocks with which to defragment the filesystem. Clearly, this isn’t something you’d want to do on a filesystem that contained a database, or in some other situation in which you had a large number of writes and deletions. However, if like in our case, you’re dealing with mostly static data, in a write once, read many (WORM) type configuration, this shouldn’t have a noticeable impact. In fact, the primary developer for tune2fs, Google’s Theodore Ts’o, can be seen here confirming this supposition.

So there you have it. You may be missing out on some valuable space, especially in those multi-terabyte arrays out there. And though it’s no longer 1998, and terabytes do come fairly cheap these days, it’s still nice to know you’re getting all that you paid for.