ZFS On Linux In Production?

Home » CentOS » ZFS On Linux In Production?
CentOS 36 Comments

We are a CentOS shop, and have the lucky, fortunate problem of having ever-increasing amounts of data to manage. EXT3/4 becomes tough to manage when you start climbing, especially when you have to upgrade, so we’re contemplating switching to ZFS.

As of last spring, it appears that ZFS On Linux http://zfsonlinux.org/
calls itself production ready despite a version number of 0.6.2, and being acknowledged as unstable on 32 bit systems.

However, given the need to do backups, zfs send sounds like a godsend over rsync which is running into scaling problems of its own. (EG:
Nightly backups are being threatened by the possibility of taking over
24 hours per backup)

Was wondering if anybody here could weigh in with real-life experience?
Performance/scalability?

-Ben

PS: I joined their mailing list recently, will be watching there as well. We will, of course, be testing for a while before “making the switch”.

36 thoughts on - ZFS On Linux In Production?

  • I’ve only used ZFS on Solaris and FreeBSD. some general observations…

    1) you need a LOT of ram for decent performance on large zpools. 1GB ram above your basic system/application requirements per terabyte of zpool is not unreasonable.

    2) don’t go overboard with snapshots. a few 100 are probably OK, but
    1000s (*) will really drag down the performance of operations that enumerate file systems.

    3) NEVER let a zpool fill up above about 70% full, or the performance really goes downhill.

    4) I prefer using striped mirrors (aka raid10) over raidz/z2, but my applications are primarily database.

    (*) ran into a guy who had 100s of zfs ‘file systems’ (mount points), per user home directories, and was doing nightly snapshots going back several years, and his zfs commands were taking a long long time to do anything, and he couldn’t figure out why. I think he had over 10,000
    filesystems * snapshots.

  • That seems quite reasonable to me. Our existing equipment has far more than enough RAM to make this a comfortable experience.

    Our intended use for snapshots is to enable consistent backup points, something we’re simulating now with rsync and its hard-link option. We haven’t figured out the best way to do this, but in our backup clusters we have rarely more than 100 save points at any one time.

    Thanks for the tip!

    Wow. Couldn’t he have the same results by putting all the home directories on a single ZFS partition?

  • XFS is better than ext3/4 for many applications, but it’s still not as powerful as ZFS, which basically combines RAID, filesystem, and LVM into one. It sounds like the OP is really looking to take advantage of the extra features of ZFS.

    I don’t have my own, but I have heard of other shops which have had lots of success with ZFS on OpenSolaris and their variants. I know of some places which are starting to put ZFS on linux into testing or preproduction, but nothing really extensive yet.

    –keith

  • I believe he wanted quotas per user. ZFS quotas were only implemented at the file system level, at least as of whatever version he was running
    (I don’t know if thats changed, as I never mess with quotas).

  • Most definitely. There are a few features that I’m looking for:

    1) MOST IMPORTANT: STABLE!

    2) The ability to make the partition bigger by adding drives with very minimal/no downtime.

    3) The ability to remove an older, (smaller) drive or drives in order to replace with larger capacity drives without downtime or having to copy over all the files manually.

    4) The ability to create snapshots with no downtime.

    5) The ability to synchronize snapshots quickly and without having to scan every single file. (backups)

    6) Reasonable failure mode. Things *do* go south sometimes. Simple is better, especially when it’s simpler for the (typically highly stressed)
    administrator.

    7) Big. Basically all filesystems in question can handle our size requirements. We might hit a 100 TB partition in the next 5 years.

    I think ZFS and BTRFS are the only candidates that claim to do all the above. Btrfs seems to have been “stable in a year or so” for as long as I could keep a straight face around the word “Gigabyte”, so it’s a non-starter at this point.

    LVM2/Ext4 can do much of the above. However, horror stories abound, particularly around very large volumes. Also, LVM2 can be terrible in failure situations.

    XFS does snapshots, but don’t you have to freeze the volume first?
    Xfsrestore looks interesting for backups, though I don’t know if there’s a consistent “freeze point”. (what about ongoing writes?) Not sure about removing HDDs in a volume with XFS.

    Not as sure about ZFS’ stability on Linux (those who run direct Unix derivatives seem to rave about it) and failure modes.

  • Am 25.10.2013 um 00:47 schrieb John R Pierce :

    User and group quotas have been possible for some time.

    ZFS is cool. But there are a lot of issues and stuff that needs to be tuned but is difficult to find out if it needs to be tuned.

    Especially, if you run into performance-problems.

    Once you have some experience with it, I recommend reading this blog:
    http://nex7.blogspot.ch

    and of course, the FreeNAS forum, where you can read about stuff like that:

    https://bugs.freenas.org/issues/1531

    On the surface, ZFS is great. But god help you if you run into problems.

  • We tested ZFS on CentOS 6.4 a few months ago using a descend Supermicro server with 16GB RAM and 11 drives on RaidZ3. Same specs as a middle range storage server that we build mainly using FreeBSD.

    Performance was not bad but eventually we run into a situation were we could not import a pool anymore after a kernel / modules update.

    I would not recommend it for production…

  • XFS is quite stable in CentOS 6.4 64bit. there was a flakey kernel issue circa 6.2.

    XFS+LVM+mdraid does this, but it requires several manual steps…

    I’d take the new drives, add them to a new md mirror, then add that md device to the volume group, then lvextend the logical volume, and finally xfs_grow the file system. yes, thats a bunch more steps than the zpool/zfs commands, but in fact zfs is doing much the same thing internally.

    I believe lvm also lets you replace pv’s in the vg with new larger ones. I haven’t had to do this yet.

  • Be careful: you may have been reading some ZFS hype that turns out not as rosy in reality.

    Ideally, ZFS would work like a Drobo with an infinite number of drive bays. Need to add 1 TB of disk space or so? Just whack another 1 TB
    disk into the pool, no problem, right?

    Doesn’t work like that.

    You can add another disk to an existing pool, but it doesn’t instantly make the pool bigger. You can make it a hot spare, but you can’t tell ZFS to expand the pool over the new drive.

    “But,” you say, “didn’t I read that….” Yes, you did. ZFS *can* do what you want, just not in the way you were probably expecting.

    The least complicated *safe* way to add 1 TB to a pool is add *two* 1 TB
    disks to the system, create a ZFS mirror out of them, and add *that*
    vdev to the pool. That gets you 1 TB of redundant space, which is what you actually wanted. Just realize, you now have two separate vdevs here, both providing storage space to a single pool.

    You could instead turn that new single disk into a non-redundant separate vdev and add that to the pool, but then that one disk can take down the entire pool if it dies.

    Another problem is that you have now created a system where ZFS has to guess which vdev to put a given block of data on. Your 2-disk mirror of newer disks probably runs faster than the old 3+ disk raidz vdev, but ZFS isn’t going to figure that out on its own. There are ways to
    “encourage” ZFS to use one vdev over another. There’s even a special case mode where you can tell it about an SSD you’ve added to act purely as an intermediary cache, between the spinning disks and the RAM caches.

    The more expensive way to go — which is simpler in the end — is to replace each individual disk in the existing pool with a larger one, letting ZFS resilver each new disk, one at a time. Once all disks have been replaced, *then* you can grow that whole vdev, and thus the pool.

    But, XFS and ext4 can do that, too. ZFS only wins when you want to add space by adding vdevs.

    Some RAID controllers will let you do this. XFS and ext4 have specific support for growing an existing filesystem to fill a larger volume.

    I find it simpler to use ZFS to replace a failed disk than any RAID BIOS
    or RAID management tool I’ve ever used. ZFS’s command line utilities are quite simply slick. It’s an under-hyped feature of the filesystem, if anything.

    A lot of thought clearly went into the command language, so that once you learn a few basics, you can usually guess the right command in any given situation. That sort of good design doesn’t happen by itself.

    All other disk management tools I’ve used seem to have just accreted features until they’re a pile of crazy. The creators of ZFS came along late enough in the game that they were able to look at everything and say, “No no no, *this* is how you do it.”

    I don’t think btrfs’s problem is stability as much as lack of features.
    It only just got parity redundancy (“RAID-5/6”) features recently, for example.

    It’s arguably been *stable* since it appeared in release kernels about four years ago.

    One big thing may push you to btrfs: With ZFS on Linux, you have to patch your local kernels, and you can’t then sell those machines as-is outside the company. Are you willing to keep those kernels patched manually, whenever a new fix comes down from upstream? Do your servers spend their whole life in house?

    It wouldn’t surprise me if ZFS on Linux is less mature than on Solaris and FreeBSD, purely due to the age of the effort.

    Here, we’ve been able to use FreeBSD on the big ZFS storage box, and share it out to the Linux and Windows boxes over NFS and Samba.

  • To be fair, you want to treat XFS the same way.

    And it, too is “unstable” on 32-bit systems with anything but smallish filesystems, due to lack of RAM.

  • I thought it had stack requirements that 32 bit couldn’t meet, and it would simply crash, so it is not built into 32bit versions of EL6.

  • yeah, I guess I should have made that clearer, thats exactly what you do.

    and, it doesn’t restripe old files til they get rewritten. new stuff will be striped across all the vdevs, old stuff stays where it is.

  • We have redundancy at the server/host level, so even if we have a fileserver go completely offline, our application retains availability. We have an API in our application stack that negotiates with the (typically 2 or 3) file stores.

    Performance isn’t so much an issue – we’d partition our cluster and throw a few more boxes into place if it became a bottle neck.

    Not sure enough of the vernacular but lets say you have 4 drives in a RAID 1 configuration, 1 set of TB drives and another set of 2 TB drives.

    A1 < -> A2 = 2x 1TB drives, 1 TB redundant storage. B1 < -> B2 = 2x 2TB drives, 2 TB redundant storage.

    We have 3 TB of available storage. Are you suggesting we add a couple of
    4 TB drives:

    A1 < -> A2 = 2x 1TB drives, 1 TB redundant storage. B1 < -> B2 = 2x 2TB drives, 2 TB redundant storage. C1 < -> C2 = 2x 4TB drives, 4 TB redundant storage.

    Then wait until ZFS moves A1/A2 over to C1/C2 before removing A1/A2? If so, that’s capability I’m looking for.

    The only way I’m aware of ext4 doing this is with resizee2fs, which is extending a partition on a block device. The only way to do that with multiple disks is to use a virtual block device like LVM/LVM2 which (as I’ve stated before) I’m hesitant to do.

    LVM2 will let you remove a drive without taking it offline. Can XFS do this without some block device virtualization like LVM2? (I didn’t think so)

    I sooo hear your music here! What really sucks about filesystem management is that at the time when you really need to get it right is when everything seems to be the most complex.

    For an example, btrfs didn’t have any sort of fsck, even though it was touted as at least “release candidate”. There was one released a while back that had some severe limitations. This has made me wary.

    Are you sure about that? There are dkml RPMs on the website. http://zfsonlinux.org/epel.html

    The install instructions:

    $ sudo yum localinstall –nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release-1-3.el6.noarch.rpm
    $ sudo yum install zfs

    Much as I’m a Linux Lover, we may end up doing the same and putting up with the differences between *BSD and CentOS.

  • Yes, ZFS is complicated enough to have a specialized vocabulary.

    I used two of these terms in my previous post:

    – vdev, which is a virtual device, something like a software RAID. It is one or more disks, configured together, typically with some form of redundancy.

    – pool, which is one or more vdevs, which has a capacity equal to all of its vdevs added together.

    Well, maybe.

    You would have 3 TB *if* you configured these disks as two separate vdevs.

    If you tossed all four disks into a single vdev, you could have only 2 TB because the smallest disk in a vdev limits the total capacity.

    (This is yet another way ZFS isn’t like a Drobo[*], despite the fact that a lot of people hype it as if it were the same thing.)

    No. ZFS doesn’t let you remove a vdev from a pool once it’s been added, without destroying the pool.

    The supported method is to add disks C1 and C2 to the *A* vdev, then tell ZFS that C1 replaces A1, and C2 replaces A2. The filesystem will then proceed to migrate the blocks in that vdev from the A disks to the C disks. (I don’t remember if ZFS can actually do both in parallel.)

    Hours later, when that replacement operation completes, you can kick disks A1 and A2 out of the vdev, then physically remove them from the machine at your leisure. Finally, you tell ZFS to expand the vdev.

    (There’s an auto-expand flag you can set, so that last step can happen automatically.)

    If you’re not seeing the distinction, it is that there never were 3 vdevs at any point during this upgrade. The two C disks are in the A vdev, which never went away.

    Yes, implicit in my comments was that you were using XFS or ext4 with some sort of RAID (Linux md RAID or hardware) and Linux’s LVM2.

    You can use XFS and ext4 without RAID and LVM, but if you’re going to compare to ZFS, you can’t fairly ignore these features just because it makes ZFS look better.

    Neither does ZFS.

    btrfs doesn’t need an fsck for pretty much the same reason ZFS doesn’t. Both filesystems effectively keep themselves fsck’d all the time, and you can do an online scrub if you’re ever feeling paranoid.

    ZFS is nicer in this regard, in that it lets you schedule the scrub operation. You can obviously schedule one for btrfs, but that doesn’t take into account scrub time. If you tell ZFS to scrub every day, there will be 24 hours of gap between scrubs.

    We use 1 week at the office, and each scrub takes about a day, so the scrub date rotates around the calendar by about a day per week.

    ZFS also has better checksumming than btrfs: up to 256 bits, vs 32 in btrfs. (1 in 4 billion odds of irrecoverable data per block is still pretty good, though.)

    All of the ZFSes out there are crippled relative to what’s shipping in Solaris now, because Oracle has stopped releasing code. There are nontrivial features in zpool v29+, which simply aren’t in the free forks of older versions o the Sun code.

    Some of the still-active forks are of even older versions. I’m aware of one popular ZFS implementation still based on zpool *v8*.

    If all you’re doing is looking at feature sets, you can find reasons to reject every single option.

    It is *possible* that keeping the CDDL ZFS code in a separate module manages to avoid tainting the GPL kernel code, in the same way that some people talk themselves into allowing proprietary GPU drivers with DRM support into their kernels.

    You’re playing with fire here. Bring good gloves.

    [*] or other hybrid RAID system; I don’t mean to suggest that only Drobo can do this

  • openZFS is doing pretty well on the BSD/etc side of things. some of the original developers of ZFS, who long since bailed on Oracle, are contributing code thats not in the Oracle branch, they forked in 2010, with the last release from Sun, when OpenSolaris was discontinued. The current version of OpenZFS no longer relies on ‘version numbers’, instead it has ‘feature flags’ for all post v28 features. The version in my BSD 9.1-stable system has feature-flags for…

    async_destroy (read-only compatible)
    Destroy filesystems asynchronously. empty_bpobj (read-only compatible)
    Snapshots use less space. lz4_compress
    LZ4 compression algorithm support.

  • Thanks for the clarification of terms.

    Two separate vdevs is pretty much what I was after. Drobo: another interesting option :)

    I see the distinction about vdevs vs. block devices. Still, the process you outline is *exactly* the capability that I’m looking for, despite the distinction in semantics.

    I’ve had good results with Linux’ software RAID+Ext[2-4]. For example, I *love* that you can mount a RAID partitioned drive directly in a worst-case scenario. LVM2 complicates administration terribly. The widely touted, simplified administration of ZFS is quite attractive to me.

    I’m just trying to find the best tool for the job. That may well end up being Drobo!

  • huh? it hugely simplifies it for me, when I have lots of drives. I just wish mdraid and lvm were better integrated. to see how it should have been done, see IBM AIX’s version of lvm. you grow a jfs file system, it automatically grows the underlying LV (logical volume), online, live. mirroring in AIX is done via lvm.

  • FWIW, I manage a small IT shop with a redundant pair of ZFS file servers running the zfsonlinux.org package on 64-bit ScientificLinux-6
    platforms. CentOS-6 would work just as well. Installing it with yum couldn’t be simpler, but configuring it takes a bit of reading and experimentation. I reserved a bit more than 1GByte of RAM for each TByte of disk.

    One machine (20 useable TBytes in raid-z3) is the SMB server for all of the clients, and the other machine (identically configured) sits in the background acting as a hot spare. Users tell me that performance is quite good.

    After about 2 months of testing, there have been no problems whatsoever, although I’ll admit the servers do not operate under much stress. There is a cron job on each machine that does a scrub every Sunday.

    The old ext4 primary file servers have been shut down and the ZFS boxes put into production, although one of the old ext4 servers will remain rsync’d to the new machines for a few more months (just in case).

    The new servers have the zfsonlinux repositories configured for manual updates, but the two machines tend to be left alone unless there are important security updates or new features I need.

    To keep the two servers in sync I use ‘lsyncd’ which is essentially a front-end for rsync that cuts down thrashing and overhead dramatically by excluding the full filesystem scan and using inotify to figure out what to sync. This allows almost-real-time syncing of the backup machine. (BTW, you need to crank the resources for inotify waaaaay up for large filesystems with a couple million files.)

    So far, so good. I still have a *lot* to learn about ZFS and its feature set, but for now it’s doing the job very nicely. I don’t miss the long ext4 periodic fsck’s one bit :-)

    YMMV, of course, Chuck

  • This must be the zpool v5000 thing I saw while researching my previous post. Apparently ZFSonLinux is doing the same thing, or is perhaps also based on OpenZFS.

  • Try everything. Seriously.

    You won’t know what you like, and what works *for you* until you have some experience. Buy a Drobo for the home, replace one of your old file servers with a FreeBSD ZFS box, enable LVM on the next Linux workstation.

    Drobos are no panacea, either.

    Years ago, my Drobo FS would disappear from the network occasionally, and have to be rebooted. (This seems to be fixed now.)

    My boss’s first-generation Drobo killed itself in a power outage. It was directly attached to his Windows box, and on restart, chkdsk couldn’t find a filesystem at all. A data recovery program was able to pull files back off the disk, though, so it’s not like the unit was actually dead. It just managed to corrupt the NTFS data structures thoroughly, despite the fact that it’s supposed to be a redundant filesystem. It implies Drobo isn’t using a battery-backed RAM cache, for their low-end units at least.

    Every Drobo I’ve ever used[*] has been much slower than a comparably-priced “dumb” RAID.

    The first Drobos would benchmark at about 20 MByte/sec when populated by disks capable of 100 MByte/sec raw. The two subsequent Drobo generations were touted as faster, but I don’t think I ever hit even 30
    MByte/sec.

    Data migration after replacing a disk is also uncomfortably slow. The fastest I’ve ever seen a disk replacement take is about a day. As disks have gotten bigger, my existing Drobos haven’t gotten faster, so now migration might take a week! It’s for this single reason that I now refuse to use single-disk redundancy with Drobos. The window without protection is just too big now.

    A lot of this is doubtless down to the small embedded processor in these things. ZFS on a “real” computer is simply in a different class.

    [*] I haven’t yet used a Thunderbolt or “B” series professional version.
    It is possible they’re running at native disk speeds. But then, they’re even more expensive.

  • …with cron…

    This is important because a ZFS scrub takes absolute lowest priority.
    (Presumably true for btrfs, too.) Any time the filesystem has to service an I/O request, the scrub stops, then resumes when the I/O
    request has completed, unless another has arrived in the meantime.

    This means that you cannot know how long a scrub will take unless you can exactly predict your future disk I/O. Scheduling a scrub with cron could land you in a situation where the previous scrub is still running due to unusually high I/O when another scrub request comes in.

    I initially set our ZFS file server up so that it would start scrubbing at close of business on Friday, but due to the way ZFS scrub scheduling works, the most recent scrub started late Wednesday and ran into Thursday. This isn’t a problem. The scrub doesn’t run in parallel to normal I/O, I don’t even notice that the array is scrubbing itself unless I go over and watchen das blinkenlights astaunished.

  • lvm can do this with the –resizefs flag for lvextend, one command to grow both the logical volume and the fs, and it can be done live provided the fs supports it.

    Peter

  • Joining the discussion late, and don’t really have anything to contribute on the ZFSonLinux side of things…

    At $DAYJOB we have been running ZFS via Nexenta (previously via Solaris
    10) for many years. We have about 5PB of this and the primary use case is for backups and handling of imagery.

    For the most part, we really, really like ZFS. My feeling is that ZFS
    itself (at least in the *Solaris form) is rock solid and stable. Other pieces of the stack — namely SMB/CIFS and some of the management tools provided by the various vendors are a bit more questionable. We spend a bit more time fighting weirdnesses with things higher up the stack than we do say on our NetApp environment. Too be expected.

    I’m waiting for Red Hat or someone else to come out and support ZFS

  • Have run into this one (again — with Nexenta) as well. It can be pretty dramatic. We tend to set quotas to ensure we don’get exceed 75%
    or so max, but….

    …at least on the Solaris side, there’s a tunable you can set that keeps the metaslab (which gets fragmented and inefficient when pool utilization is high) entirely in memory. This completely resolves our throughput issue, but does require that you have sufficient memory to load the thing…

    echo “metaslab_debug/W 1” | mdb -kw

    There may be a ZOL equivalent.

    Ray

  • a significant degrade in performance on systems running at 75-80 % of their pool capacity. I understand that the nature of COW will increase fragmentation. On large storages though 70% out of 100TB means that you have to always maintain 30TB free which is not a small number in terms of cost per TB.


    George Kontostanos

  • Playing with lsyncd now, thanks for the tip!

    One qeustion though: why did you opt to use lsyncd rather than using ZFS
    snapshots/send/receive?

    Thanks,

    Ben

  • To be honest is not easier to install on server FreeBSD or Solaris where ZFS is natively supported? I moved my own server to FreeBSD and I didn’t noticed huge difference between Linux distros and freebsd, I have no idea what about Solaris but it might be still similar environment.

    Sent from my iPhone

  • Greetings,

    And I know of a shop which could not recover a huge ZFS on freebsd and had to opt for something like isilon or something like that due to unavailability of controller drivers for freebsd.

  • Why is it? It sounds cost intensive, if not ridiculous. Disk space not to used, forbidden land… Is the remaining 30% used by some ZFS internals?

  • Probably just simple physics. If ZFS is smart enough to allocate space ‘near’ other parts of the related files/directories/inodes it will have to do worse when there aren’t any good choices and it has to fragment things into the only remaining spaces and make the disk heads seek all over the place. Might not be a big problem on SSD’s though.

LEAVE A COMMENT