What FileSystems For Large Stores And Very Very Large Stores?

Home » CentOS » What FileSystems For Large Stores And Very Very Large Stores?
CentOS 13 Comments

I was learning about the different FS exists. I was working on systems that ReiserFS was the star but since there is no longer support from the creator there are other consolidations to be done. I want to ask about couple FS options. EXT4 which is amazing for one node but for more it’s another story. I have heard about GFS2 and GlusterFS and read the docs and official materials from RH on them. In the RH docs it states the EXT4 limit files per directory is 65k and I had a directory which was pretty loaded with files and I am unsure exactly what was the size but I am almost sure it was larger the 65k files per directory.

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn’t be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in section 2.1 it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

It seems to more that GlusterFS is a better choice then Swift since RH do provide support for it.

Every response will be appreciated.

Thanks, Eliezer

13 thoughts on - What FileSystems For Large Stores And Very Very Large Stores?

  • Eliezer Croitoru wrote:

    I would not go over 16TB (actually, I should say that I have not gone over it, and I have several LARGE RAID boxes). The tools aren’t there, or don’t work usefully (days to check a filesystem isn’t “useful”).

    We tried out glusterfs a couple years ago, but I gather, from my manage and the user who tried it, that there were some issues. I have no idea how one would fsck a glusterfs.

    *shrug* Why not stay under 16TB, and just mount the filesystems where you want? Were you looking at someone creating a single file larger than that?

    mark
    mark

  • If you really only have one client, you might look at ceph for distributed block storage with xfs on top so you don’t need to run through fuse. Or if your application can be changed to use the s3
    interface you could get away from the posix filesystem bottlenecks completely.

    No experience with this stuff – just sounds like the promising up-and-coming thing…

  • —– Original Message —–
    | I was learning about the different FS exists.
    | I was working on systems that ReiserFS was the star but since there
    | is
    | no longer support from the creator there are other consolidations to
    | be
    | done.
    | I want to ask about couple FS options.
    | EXT4 which is amazing for one node but for more it’s another story.
    | I have heard about GFS2 and GlusterFS and read the docs and official
    | materials from RH on them.
    | In the RH docs it states the EXT4 limit files per directory is 65k
    | and I
    | had a directory which was pretty loaded with files and I am unsure
    | exactly what was the size but I am almost sure it was larger the 65k
    | files per directory.
    |
    | I was considering using GlusterFS for a very large storage system
    | with
    | NFS front.
    | I am still unsure EXT4 should or shouldn’t be able to handle more
    | then
    | 16TB since the linux kernel ext4 docs at:
    | https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in
    | section 2.1
    | it states: * ability to use filesystems > 16TB (e2fsprogs support not
    | available yet).
    | so can I use it or not?? if there are no tools to handle this size
    | then
    | I cannot trust it.
    |
    | I want to create a storage with more then 16TB based on GlusterFS
    | since
    | it allows me to use 2-3 rings FS which will allow me to put the
    | storage
    | in a form of:
    | 1 client -> HA NFS servers -> GlusterFS cluster.
    |
    | it seems to more that GlusterFS is a better choice then Swift since
    | RH
    | do provide support for it.
    |
    | Every response will be appreciated.
    |
    | Thanks,
    | Eliezer

    As someone who has some rather large volumes for research storage I will say that ALL of the file systems have limitations, *especially* in the case of failures. I have typical volumes that range from 16TB up to 48TB and the big issue is when it comes to performing file system checks. You see, there is a lot of information that gets loaded into memory in order to perform a file system check. A number of years ago I was unable to perform a EXT4 file system check on a 15TB volume without consuming over 32GB of memory on a file system with very few files. At the time, the file server only had 8GB of memory, so this presented a problem.

    However, while this problem was solvable it was also subject to usage. The file system in question only had large files on it. These files were typically gigabytes in size, but for another filer, this time with 48GB of memory but a tens of millions of very small files, the file system check for it took nearly 96GB of memory in order to perform a file system check.

    So far, without a doubt, XFS has been the best “overall” file system for our usages, but YMMV. It would seem that Red Hat is also pushing it as the file system of choice going forward until something better ( btrfs *snicker* ) comes along. XFS is also the recommended file system for use with GlusterFS so that makes it an easy choice too.

    GlusterFS itself has some H/A built in. You can talk to any of the GlusterFS servers via NFS and it will fully operate in an active/active manner so your diagram would be 1 client -> Gluster Cluster (via protocols supported by Gluster NFS/CIFS/NATIVE). I have found it to be rather fragile as well in some respects and performance for some of my workloads just don’t map well to it even though it looks like they should gain some benefit. However, it does work seemingly well for other workloads and it is being actively developed.

    GlusterFS also allows you to “import” existing file systems at a later time. So feel free to start off with a standard XFS volume, but be mindful of the XFS options that GlusterFS requires, namely the inode size being 64K, then if you decide to add cluster to your storage infrastructure you can perform the said “import” function and then start replication or distributed file serving from Gluster.


    James A. Peltier Manager, IT Services – Research Computing Group Simon Fraser University – Burnaby Campus Phone : 778-782-6573
    Fax : 778-782-3045
    E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices

    “A successful person is one who can lay a solid foundation from the bricks others have thrown at them.” -David Brinkley via Luke Shaw

  • Have you done anything with ceph? With/without a filesystem on top?

    Is the (snicker) from the slow development or do you think the goals are impossible? Btrfs on top of ceph sounds as good as a posix-looking fs could get.

  • —– Original Message —–
    | | >
    | > As someone who has some rather large volumes for research storage I
    | > will say that ALL of the file systems have limitations,
    | > *especially* in the case of failures. I have typical volumes that
    | > range from 16TB up to 48TB and the big issue is when it comes to
    | > performing file system checks.
    |
    | Have you done anything with ceph? With/without a filesystem on top?

    Nope, went with GlusterFS testing as it was

    1) Something that we could get full stack support if we opted to
    2) Did as little as possible to deviate from the base OS as provided

    | > So far, without a doubt, XFS has been the best “overall” file
    | > system for our usages, but YMMV. It would seem that Red Hat is
    | > also pushing it as the file system of choice going forward until
    | > something better ( btrfs *snicker* ) comes along. XFS is also the
    | > recommended file system for use with GlusterFS so that makes it an
    | > easy choice too.
    | >
    |
    | Is the (snicker) from the slow development or do you think the goals
    | are impossible? Btrfs on top of ceph sounds as good as a
    | posix-looking fs could get.

    I don’t like to start flame wars so lets just say that I think the limitations imposed on btrfs from a design perspective were such that I don’t think there is a chance that it will ever get the capabilities of the file system that it is trying to compete against (ZFS). There is a reason that the ZFS developers decided to toss out years of experience in file systems and start over. The overhead and limitations of the traditional methods just didn’t cut it.

    Again, these are only my opinions, based on what I see in front of me today and taking into consideration what I saw ZFS go through over the past 5-7 years.


    James A. Peltier Manager, IT Services – Research Computing Group Simon Fraser University – Burnaby Campus Phone : 778-782-6573
    Fax : 778-782-3045
    E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices

    “A successful person is one who can lay a solid foundation from the bricks others have thrown at them.” -David Brinkley via Luke Shaw

  • I just think it is sad that the linux kernel license prohibits distribution with ‘best-of-breed’ components… But conceptually, distributing the block storage seems like a good idea and zfs embeds a lot of the block device management.

  • —– Original Message —–
    | | >
    | > | Is the (snicker) from the slow development or do you think the
    | > | goals
    | > | are impossible? Btrfs on top of ceph sounds as good as a
    | > | posix-looking fs could get.
    | >
    | > I don’t like to start flame wars so lets just say that I think the
    | > limitations imposed on btrfs from a design perspective were such
    | > that I don’t think there is a chance that it will ever get the
    | > capabilities of the file system that it is trying to compete
    | > against (ZFS). There is a reason that the ZFS developers decided
    | > to toss out years of experience in file systems and start over.
    | > The overhead and limitations of the traditional methods just
    | > didn’t cut it.
    |
    | I just think it is sad that the linux kernel license prohibits
    | distribution with ‘best-of-breed’ components… But conceptually,
    | distributing the block storage seems like a good idea and zfs embeds
    | a
    | lot of the block device management.
    |
    | —
    | Les Mikesell
    | lesmikesell@gmail.com

    I guess that’s what FUSE is for, LOL! Indeed it ZFS does implement this, but none of the block storage impacts or is utilized by anything but ZFS. With btrfs this all has to do with maintaining the legacy way of doing things which is a severely limiting factor. The ZFS devs knew that this would not be a backward compatible change that it’s native file system UFS would be able to use at all anyway. They knew that UFS/SFS was not what was required for the next round of storage technology either.

    I just think that were were many correct justifications made to do what the ZFS devs did and that by doing so ZFS became a much better product for it. I mean try to understand the btrfs syntax vs the zfs syntax. btrfs is INSANE!


    James A. Peltier Manager, IT Services – Research Computing Group Simon Fraser University – Burnaby Campus Phone : 778-782-6573
    Fax : 778-782-3045
    E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices

    “A successful person is one who can lay a solid foundation from the bricks others have thrown at them.” -David Brinkley via Luke Shaw

  • OK so back to the issue in hands. The issue is that I have a mail storage for more then 65k users per domain and the ext4 doesn’t support this size of directory list. The reiser FS indeed fits for the purpose but ext4 doesn’t even start to scratch it. Now the real question is that:
    What FS will you use for dovecot backhand to store a domain with more then 65k users?

    Eliezer

  • Just for interest I have had two 44TB “raid 6” arrays using EXT4, running with heavy useage 24/7 on el6 now since January 2013 without any problems so far. I rebuilt “e2fsprogs” from source. Something along the lines below looking at my notes.

    wget http://atoomnet.net/files/rpm/e2fsprogs/e2fsprogs-1.42.6-1.el6.src.rpm yum-builddep e2fsprogs-1.42.6-1.el6.src.rpm rpmbuild –rebuild –recompile e2fsprogs-1.42.6-1.el6.src.rpm cd /root/rpmbuild/RPMS/x86_64
    rpm -Uvh *.rpm

    ###### build array with a partition #######
    parted /dev/sda mkpart primary ext4 1 -1
    mkfs.ext4 -L sraid1v -E strided,stripe-width84 /dev/sda1

    ###### build array without a partition #######

    mkfs.ext4 -L sraid1v -E strided,stripe-width84 /dev/sda

    Maybe this will help someone.

    Cheers Steve

  • It was back in 1995 when I had this kind of problem with about 0.05 M accounts, and our solution was used until at least 0.5 M accounts, when I left the company. The filesystem in question back then degraded severely in performance when there were more than about 200 files in a directory.

    We ended up cooking our own way using FNV-1a hash, but Dovecot has something similar natively:

    http://wiki2.dovecot.org/MailLocation

    The “Directory hashing” is the interesting part, although that explanation does look like needing a complete rewrite.

    Having lots of file names in directory will likely mean that a) your directory file is actually grown over time in small extents spanning all over the disk space and b) thus its reading becomes very inefficient.

    Having a hashed subdirectory structure will mean that a 4kB file system block size will likely not overflow , or at most have only a few extend blocks, and their reading will not be _that_ much slower.

    Best Regards, Matti Aarnio

  • Thanks!
    This was very helpful and I am testing something and writing on the dovecot mailing list about it.

    Eliezer

LEAVE A COMMENT