Is Glusterfs Ready?

Home » CentOS » Is Glusterfs Ready?
CentOS 21 Comments

Hey,

since RH took control of glusterfs, I’ve been looking to convert our old independent RAID storage servers to several non RAID glustered ones.

The thing is that I, here and there, heard a few frightening stories from some users (even with latest release). Any one has experienced with it long enough to think one can blindly trust it or if it is almost there but not yet ready?

Thx, JD

21 thoughts on - Is Glusterfs Ready?

  • I can’t say anything about the RH Storage Appliance, but for us, gluster up to 3.2.x was most definitely not ready. We went through a lot of pain, and even after optimizing OS config with help of gluster support, we were facing insurmountable problems. One of them was kswapd instances going into overdrive, and once the machine reached a certain load, all networking functions just stopped. I’m not saying this is gluster’s fault, but even with support we were unable to configure the machines so that this doesn’t happen. That was on CentOS
    5.6/x86_64.

    Another problem was that due to load and frequent updates (each new version was supposed to fix bugs; some weren’t fixed, and there were plenty of new ones) the filesystems became inconsistent. In theory, each file lives on a single brick. The reality was that in the end, there were many files that existed on all bricks, one copy fully intact, the others with zero size and funny permissions. You can guess what happens if you’re not aware of this and try to copy/rsync data off all bricks to different storage. IIRC there were internal changes that required going through a certain procedure during some upgrades to ensure filesystem consistency, and these procedures were followed.

    We only started out with 3.0.x, and my impression was that development was focusing on new features rather than bug fixes.

  • From: isdtor

    From: David C. Miller

    I read that 3.3 was the first “RH” release. Let’s hope they did/will focus on bug fixing… So I guess I will wait a little bit more.

    Thx to both, JD

  • We use glusterfs in the CentOS build infrastructure … and for the most part it works fairly well.

    It is sometimes very slow on file systems with lots of small files … especially for operations like find or chmod/chown on a large volume with lots of small files.

    BUT, that said, it is very convenient to use commodity hardware and have redundant, large, failover volumes on the local network.

    We started with version 3.2.5 and now use 3.3.0-3, which is faster than
    3.2.5 … so it should get better in the future.

    I can recommend glusterfs as I have not found anything that does what it does and does it better, but it is challenging and may not be good for all situations, so test it before you use it.

  • From: Johnny Hughes

    I am not too worried about bad performances. I am afraid to get paged one night because the 50+ TB of the storage cluster are gone followinf a bug/crash… It would take days/weeks to set it back up from the backups. If we were rich, I guess we would have two (or more) “geo-replicated” glusters and be able to withstand one failing… I would like the same trust level that I have in RAID.

    JD

  • I have routinely used DRBD for things like this … 2 servers, one a complete failover of the other one. Of course, that requires a 50+ TB
    file system on each machine.

  • How well do glusterfs or drbd deal with downtime of one of the members? Do they catch up quickly with incremental updates and what kind of impact does that have on performance as it happens? And is either suitable for running over distances where there is some network latency?

  • the extreme case is when one end fails, and you rebuild it,and have to replicate the whole thing. how long does it take to move 50TB across your LAN ? how fast can your file system write that much ?

  • Well, DRBD is a tried and true solution, but it requires dedicated boxes and crossover network connections, etc. I would consider it by far the best method for providing critical failover.

    I would consider gluserfs almost a different thing entirely … it provides the ability to string several partitions on different machines into one shared network volume.

    Glusterfs does also provide redundancy if you set it up that way … and if you have a fast network and enough volumes then the performance is not very degraded when a gluster volume comes back, etc.

    However, I don’t think I would trust extremely critical things on glusterfs at this point.

  • I think the keyword with solutions like glusterfs, ceph, sheepdog, etc. is
    “elasticity”. DRBD and RAID work well as long as you have a fixed size of data to deal with but once you get to a consistent data growth you need something that offers redundancy yet can be easily extended incrementally.

    Glusterfs seems to aim to be a solution that works well right now because it uses a simple file replication approach whereas ceph and sheepdog seem to go deeper and provide better architectures but will take longer to mature.

    Regards,
    Dennis

  • AFS isn’t what you expect from a distributed file system. Each machine works with cached copies of whole files and when one of them writes and closes a file the others are notified to update their copy.
    Last write wins.

  • David C. Miller writes:

    Heya,

    Well I guess I’m one of the frightening stories, or at least a previous employer was. They had a mere 0.1 petabyte store over 6
    bricks yet they had incredible performance and reliability difficulties. I’m talking about a mission critical system being unavailable for weeks at a time. At least it wasn’t customer facing (there was another set of servers for that).

    The system was down more than it was up. Reading was generally OK (but very slow) but multiple threads writing caused mayhem –
    I’m talking lost files and file system accesses going into the multiple minutes.

    In the end I implemented a 1-Tb store to be fuse-unioned over the top of the thing to take the impact of multiple threads writing to it. A
    single thread (overnight) brought the underlying glusterfs up to date.

    That got us more or less running but the darned thing spent most of its time re-indexing and balancing rather than serving files.

    To be fair, some of the problems were undoubtedly of their own making as 2 nodes were CentOS and 4 were fedora-12 – apparently the engineer couldn’t find the installation CD for the 2 new nodes and ‘made do’
    with what he had! I recall that a difference in the system ‘sort’
    command gave all sorts of grief until it was discovered, never mind different versions of the gluster drivers.

    I’d endorse Johnny’s comments about it not handling large numbers of small files well (ie < ~ 10 Mb). I believe it was designed for large multi-media files such as clinical X-Rays. ie a small number of large files. Another factor is that the available space is the physical space divided by 4 due to the replication across the nodes on top of the nodes being RAID’d themselves. Lesse now – that was all of 6 months ago – unlike most of my war stories, it’s not ancient history!!

  • That is the problem with most of these stories in that the setups tend to be of the “adventurous” kind. Not only was the setup very asymmetrical but Fedora 12 was long outdated even 6 months ago. This kind of setup should be categorized as “highly experimental” and not something you actually use in production.

    That’s a problem with all distributed filesystems. For a few large files the additional time needed for round-trips is usually dwarfed by the actual I/O requests themselves so you don’t notice it (much). With a ton of small files you incur lots of metadata fetching round-trips for every few kbyte read/written which slows things down by a great deal. So basically if you want top performance for lot of small files don’t use distributed filesystems.

    That really depends on your setup. I’m not sure what you mean by the nodes being raided themselves. If you run a four node cluster and keep two copies of each file you would probably create two pairs of nodes where one node is replicated to the other and then create a stripe over these two pairs which should actually improve performance. This would mean your available space would be cut in half and not be divided by 4.

    Regards,
    Dennis

  • From: Dennis Jacobfeuerborn

    I think he meant gluster “RAID1” plus hardware RAID (10 I guess from the x2, instead of standalone disks).

    JD

  • Hello, this comment was posted on a site I administer, where I chronologically publish an archive of some CentOS (and some other distros) lists:

    ===== [comment] ====A new comment on the post “Is Glusterfs Ready?”

    Author : Jeff Darcy (IP: 173.48.139.36 , pool-173-48-139-36.bstnma.fios.verizon.net)
    E-mail : jeff@pl.atyp.us URL : http://pl.atyp.us Whois : http://whois.arin.net/rest/ip/173.48.139.36
    Comment:

    Hi. I’m one of the GlusterFS developers, and I’ll try to offer a slightly different perspective.

    First, sure GlusterFS has bugs. Some of them even make me cringe. If we really wanted to get into a discussion of the things about GlusterFS that suck, I’d probably be able to come up with more things than anybody, but one of the lessons I learned early in my career is that seeing all of the bugs for a piece of software leads to a skewed perspective. Some people have had problems with GlusterFS but some people have been very happy with it, and I guarantee that every alternative has its own horror stories. GlusterFS and XtreemFS were the only two distributed filesystems that passed some *very simple* tests I ran last year. Ceph crashed. MooseFS
    hung (and also doesn’t honor O_SYNC). OrangeFS corrupted data. HDFS
    cheats by buffering writes locally, and doesn’t even try to implement half of the required behaviors for a general-purpose filesystem. I can go through any of those codebases and find awful bug after horrible bug after data-destroying bug . . . and yet each of them has their fans too, because most users could never possibly hit the edge conditions where those bugs exist. The lesson is that anecdotes do not equal data. Don’t listen to vendor hype, and don’t listen to anti-vendor bashing either. Find out what the *typical* experience across a large number of users is, and how well the software works in your own testing.

    Second, just as every piece of software has bugs, every truly distributed filesystem (i.e. not NFS) struggles with lots of small files. There has been some progress in this area with projects like Pomegranate and GIGA+, we have some ideas for how to approach it in GlusterFS (see my talk at SDC
    next week), but overall I think it’s important to realize that such a workload is likely to be problematic for *any* offering in the category. You’ll have to do a lot of tuning, maybe implement some special workarounds yourself, but if you want to combine this I/O profile with the benefits of scalable storage it can all be worth it.

    Lastly, if anybody is paying a 4x disk-space penalty (at one site) I’d say they’re overdoing things. Once you have replication between servers, RAID-1 on each server is overkill. I’d say even RAID-6 is overkill. How many simultaneous disk failures do you need to survive? If the answer is two, as it usually seems to be, then GlusterFS replication on top of RAID-5
    is a fine solution and requires a maximum of 3x (more typically just a bit more than 2x). In the future we’re looking at various forms of compression and deduplication and erasure codes that will all bring the multiple down even further.

    So I can’t say whether it’s ready or whether you can trust it. I’m not objective enough for my opinion on that to count for much. What I’m saying is that distributed filesystems are complex pieces of sofware, none of the alternatives are where any of us working on them would like to be, and the only way any of these projects get better is if users let us know of problems they encounter. Blog posts or comments describing specific issues, from people whose names appear nowhere on any email or bug report the developers could have seen, don’t help to advance the state of the art.

    ===== [/comment] ====

    Regards,

  • Just speaking for myself here, I’m less interested in ‘advancing’ the state of the art’ (which usually means running something painfully broken) than in finding something that already works… You didn’t paint a very rosy picture there. Would it be better to just forget filesystem semantics and use one of the distributed nosql databases
    (riak, mongo, cassandra, etc.).

  • From: Jose P. Espinal

    By ready I just meant “safe enough to transfer all our production storage on it and be 99.99% sure that it won’t vanish one night”…. Again, the same level of trust that one can have with RAID storage. It can still fail, but it is nowadays quite rare (luckily never happened to me).

    I understand that developers need testers and feedback, and I am sure you are doing an excellent job, but we will start with a small test cluster and follow the project progress.

    Thx for your input, JD

  • Hi. I’m one of the GlusterFS developers, and I’ll try to offer a slightly different perspective.

    First, sure GlusterFS has bugs. Some of them even make me cringe. If we really wanted to get into a discussion of the things about GlusterFS that suck, I’d probably be able to come up with more things than anybody, but one of the lessons I learned early in my career is that seeing all of the bugs for a piece of software leads to a skewed perspective. Some people have had problems with GlusterFS but some people have been very happy with it, and I guarantee that every alternative has its own horror stories. GlusterFS and XtreemFS were the only two distributed filesystems that passed some *very simple* tests I ran last year. Ceph crashed. MooseFS hung (and also doesn’t honor O_SYNC). OrangeFS corrupted data. HDFS cheats by buffering writes locally, and doesn’t even try to implement half of the required behaviors for a general-purpose filesystem. I can go through any of those codebases and find awful bug after horrible bug after data-destroying bug . . . and yet each of them has their fans too, because most users could never possibly hit the edge conditions where those bugs exist. The lesson is that anecdotes do not equal data. Don’t listen to vendor hype, and don’t listen to anti-vendor bashing either. Find out what the *typical* experience across a large number of users is, and how well the software works in your own testing.

    Second, just as every piece of software has bugs, every truly distributed filesystem (i.e. not NFS) struggles with lots of small files. There has been some progress in this area with projects like Pomegranate and GIGA+, we have some ideas for how to approach it in GlusterFS (see my talk at SDC next week), but overall I think it’s important to realize that such a workload is likely to be problematic for *any* offering in the category. You’ll have to do a lot of tuning, maybe implement some special workarounds yourself, but if you want to combine this I/O profile with the benefits of scalable storage it can all be worth it.

    Lastly, if anybody is paying a 4x disk-space penalty (at one site) I’d say they’re overdoing things. Once you have replication between servers, RAID-1 on each server is overkill. I’d say even RAID-6 is overkill. How many simultaneous disk failures do you need to survive? If the answer is two, as it usually seems to be, then GlusterFS replication on top of RAID-5 is a fine solution and requires a maximum of 3x (more typically just a bit more than 2x). In the future we’re looking at various forms of compression and deduplication and erasure codes that will all bring the multiple down even further.

    So I can’t say whether it’s ready or whether you can trust it. I’m not objective enough for my opinion on that to count for much. What I’m saying is that distributed filesystems are complex pieces of sofware, none of the alternatives are where any of us working on them would like to be, and the only way any of these projects get better is if users let us know of problems they encounter. Blog posts or comments describing specific issues, from people whose names appear nowhere on any email or bug report the developers could have seen, don’t help to advance the state of the art.

  • You can’t ask “is Glusterfs ready” without a response such as “for what”. In addition, simply asking is it ready “in a production environment” is still too broad. You’ll only find out if it is ready for you if you dip your toe in the water and try it out (on a small subset) and then expand or contract it’s use based on your experiences.

    My personal experience is that Gluster’s replicate has been great and given me a very cost effective way to reduce my MTTR from catastrophic failures of a RAID1 bottleneck. In other words, RAID1 is fine… until the controller dies, then duplexed RAID1 is the solution… until the CPU or power supply fails. At some point thereafter you need to replicate the entire machine and this is where Gluster comes into play.

    In addition, there is a lot of commodity hardware available that can make performance issues negligible. Specifically, infiniband HCA’s can be obtained for the price of dinner for 2; wow!

    So, I definitely want to thank the Gluster team for producing a great product and filling what seems to be a big void in the marketplace! Gluster rocks ;-)

LEAVE A COMMENT