Large File System Idea

Home » CentOS » Large File System Idea
CentOS 16 Comments

This idea is intruiging…

Suppose one has a set of file servers called A, B, C, D, and so forth, all running CentOS 6.5 64-bit, all being interconnected with 10GbE. These file servers can be divided into identical pairs, so A is the same configuration (diks, processors, etc) as B, C the same as D, and so forth
(because this is what I have; there are ten servers in all). Each file server has four Xeon 3GHz processors and 16GB memory. File server A acts as an iscsi target for logical volumes A1, A2,…An, and file server B
acts as an iscsi target for logical volumes B1, B2,…Bn, where each LVM
volume is 10 TB in size (a RAID-5 set of six 2TB NL-SAS disks). There are no file systems directly built on any of the LVM volumes. Each member of a server pair (A,B) are in different cabinets (albeit in the same machine room) and are on different power circuits, and have UPS protection.

A server system called S (which has six processors and 48 GB memory, and is not one of the file servers), acts as iscsi initiator for all targets. On S, A1 and B1 are combined into the software RAID-1 volume /dev/md101. Similarly, A2 and B2 are combined into /dev/md102, and so forth for as many target pairs as one has. The initial sync of /dev/md101 takes about 6
hours, with the sync speed being around 400 MB/sec for a 10TB volume. I
realize that only half of the 10-gig bandwidth is available while writing, since the data is being written twice.

All of the /dev/md10X volumes are LVM PV’s and are members of the same volume group, and there is one logical volume that occupies the entire volume group. An XFS file system (-i sizeQ2, inode64) is built on top of this logical volume, and S NFS-exports that to the world (an HPC cluster of about 200 systems). In my case, the size of the resulting file system will ultimately be around 80 TB. The I/O performance of the xfs file system is most excellent, and exceeds by a large amount the performance of the equivalent file systems built with such packages as MooseFS and GlusterFS: I get about 350 MB/sec write speed through the file system, and up to 800 MB/sec read.

I have built something like this, and by performing tests such as sending a SIGKILL to one of the tgtd’s, I have been unable to kill access to the file system. Obviously one has to manually intervene on the return of the tgtd in order to fail/hot-remove/hot-add the relevent target(s) to the md device. Presumably this will be made easier by using persistent device names for the targets on S.

One could probably expand this to supplement the server S with a second server T to allow the possibility of failover of the service should S
croak. I haven’t tackled that part yet.

So, what failure scenarios can take out the entire file system, assuming that both members of a pair (A,B) or (C,D) don’t go down at the same time?
There’s no doubt that I haven’t thought of something.

Steve

16 thoughts on - Large File System Idea

  • Sounds like you might be reinventing the wheel. DRBD [0] does what it sounds like you’re trying to accomplish [1].

    Especially since you have two nodes A+B or C+D that are RAIDed over iSCSI.

    It’s rather painless to set up two-nodes with DRBD. But once you want to sync three [2] or more nodes with each other, the number of resources (DRBD block devices) becomes exponentially larger. Linbit, the developers behind DRBD, call it resource stacking.

    [0] http://www.drbd.org/
    [1] http://www.drbd.org/users-guide-emb/ch-configure.html
    [2] http://www.drbd.org/users-guide-emb/s-three-nodes.html

  • I think not; see below.

    I am familiar with DRBD, having used it for a number of years. However, I
    don’t think this does what I am describing. With a conventional two-node DRBD setup, the drbd block device appears on both storage nodes, one of which is primary. In this case, writes to the block device are done from the client to the primary, and the storage I/O is done locally on the primary and is forwarded across the network by the primary to the secondary.

    What I am describing in my experiment is a setup in which the block device
    (/dev/mdXXX) appears on neither of the storage nodes, but on a third node. Writes to the block device are done from the client to the third node and are forwarded over the network to both storage servers. The whole setup can be done with only packages from the base repo.

    I don’t see how this can be accomplished with DRBD, unless the DRBD
    two-node setup then iscsi-exports the block device to the third node. With provision for failover, this is surely a great deal more complex than the setup that I have described.

    If DRBD had the ability for the drbd block device to appear on a third node (one that *does not have any storage*), then it would perhaps be different.

    Steve

  • I have tried glusterfs; the large file performance is reasonable, but the small file performance is too low to be useable.

    Steve

  • Right, DRBD is no longer available from the CentOS Extras repo (like it was in EL5).

    Ah, good point.

  • Why specifically do you care about that? Both with your solution and the DRBD one the clients only see a NFS endpoint so what does it matter that this endpoint is placed on one of the storage systems?
    Also while with you solution streaming performance may be ok latency is going to be fairly terrible due to the round-trips and synchronicity required so this may be a nice setup for e.g. a backup storage system but not really suited as a more general purpose solution.

    Regards,
    Dennis

  • The whole point of the exercise is to end up with multiple block devices on a single system so that I can combine them into one VG using LVM, and then build a single file system that covers the lot. On a budget, of course.

    Yes, I hear what you are saying. However, I have investigated MooseFS and GlusterFS using the same resources, and my experimental iscsi-based setup gives a file system that is *much* faster than either in practical use, latency notwithstanding.

    Steve

  • I have not looked at Lustre, as I have heard many negative things about it
    (including Oracle ownership). The only business using Lustre where I know the admins has had a lot of trouble with it. No redundancy.

    Fhgfs looks interesting, and I am planning on looking at it, but have not yet done so.

    MooseFS and GlusterFS have both been evaluated, and were too slow. In the case of GlusterFS, waaaay too slow.

    Steve

  • How recently have you looked at Gluster? It has seen some significant progress, though small files are still its weakest area. I believe that some use-cases have found that NFS access is faster for small files.

    Ted Miller Elkhart, IN

  • I know some Lustre admins that indeed have the far away stare similar to people that have survived natural disasters. It can be somewhat unstable and difficult to manage when you try and roll it yourself but, if you get the professionals and have it properly supported you can have a good time.

    Lustre is not owned by Oracle, its free and opensource software Licensed under GPL v2. It does have redundancy but this is handled on the hardware level with Active / Active object storage servers and meta data servers.

    Primarily supported by Intel. Well, they have the most developers and sell the most support contracts. It is a very interesting replacement for Hadoop HDFS.

    The Fraunhofer Parallel Cluster File System (FhGFS) has just been spun out of the German Institute from which is was born and has been renamed BeeGFS.
    (the germans never had a knack for snappy names :).

    It is a very strong contender for these kinds of workloads and is probably just about to be fully opensourced.

    In general Parallel filesystems such as Lustre are quite hard to get right and most people fail to grasp the complexity and the skill required in implementing them. People have a go, fsck it up (heh) and then blame the software when it doesn’t work properly. If you really have a business requirement for insane metadata performance over single, multi petabyte namespace you should be sure to tread lightly and carry a good support contract.

    I believe Gluster to be a rapidly dying project however I am willing to be set straight on this point. It seems that anyone looking at Gluster will also be looking a Ceph and this is an obviously better system.

  • Yes, I really need file system semantics; I am storing home directories.

    Steve

  • We were using glusterfs for shared home directories and it was really slow. We’re using an NFS shared and it’s working much faster.

    Mark

  • In that case, wouldn’t it be simpler to have several separate DRBD
    pairs with the directory from the appropriate server automounted at login instead of consolidating them to the point where you have scaling issues?

    And have you tried ceph’s filesystem layer?