Deduplication Data For CentOS?

Home » CentOS » Deduplication Data For CentOS?
CentOS 28 Comments

Hi list,

is there any working solution for deduplication of data for CentOS?
We are trying to find a solution for our backup server which runs a bash script invoking xdelta(3). But having this functionality in fs is much more friendly…

We have looked into lessfs, sdfs and ddar. Are these filesystems ready to use (on CentOS)?
ddar is sthg different, I know.

Thx Rainer

28 thoughts on - Deduplication Data For CentOS?

  • Am 27.08.2012 14:15, schrieb John Doe:

    Yeah I know it has this feature, but is there a working zfs implementation for linux?
    Linux is a must, because the data we are backing up are Domino databases and also is a customer’s requirement.

    And btrfs has not yet implemented this feature I think.

  • I have heard some positive feedback about http://zfsonlinux.org/ but I
    have not had time to test myself yet. It probably depends on your intended usage. It is a new in-kernel ZFS implementation (different from the old FUSE implementation).

    RHEL 6.2 x86_64 is listed as one of the supported OSes, so it probably works fine with CentOS too.

    There is some positive and negative feedback in the following links:

    https://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/5a739039623f8fb1

    http://pingd.org/2012/installing-zfs-raid-z-on-CentOS-6-2-with-ssd-caching.html

    Please share your results if you do any testing :)

  • BackupPC does exactly this. its not a generalized solution to deduplication of a file system, instead, its a backup system, designed to backup multiple targets, that implements deduplication on the backup tree it maintains.

  • Not _exactly_, but maybe close enough and it is very easy to install and try. BackupPC will use rsync for transfers and thus only uses bandwidth for the differences, but it uses hardlinks to files to dedup the storage. It will find and link duplicate content even from different sources, but the complete file must be identical. It does not store deltas, so large files that change even slightly between backups end up stored as complete copies (with optional compression).

  • Deduplication with ZFS takes a lot of RAM.

    I would not yet trust any of the linux zfs projects for data that I
    wanted to keep long term.

  • are trying to find a solution for our backup server which runs a bash script invoking xdelta(3). But having this functionality in fs is much more friendly… deduplication of a file system, instead, its a backup system, designed to backup multiple targets, that implements deduplication on the backup tree it maintains.

    I’ve tried, twice, to suggest that a workaround that doesn’t involve a new, and possibly experimental f/s would be to use rsync with hard links, which is what we do. There’s no way we have enough disk space for 5 weeks of terabytes of data….

    However, the reason I haven’t been able to suggest it is that I’m being blocked by spamhost. And when I go there, it asserts I’m listed in the CBL. And when I go *THERE*, it tells me I’m not.

    Oh, and now, when I try to go to the CBL, it’s down.

    I don’t suppose the CentOS list has a whitelist….

    mark

  • This is something I have been thinking about peripherally for a while now. What are your impressions of SDFS (OpenDedupe)? I had been hoping it would be pretty good. Any issues with it on CentOS?

    ❧ Brian Mathis

  • Am 27.08.2012 16:04, schrieb Janne Snabb:

    The website looks promising. They are using a thing called SPL, Sun/Solaris Porting Layer to be able to use the Solaris ZFS code. But there is no more OpenSolaris, isn’t it? Means they have to stay with the ZFS code from when it was open?

  • Am 27.08.2012 18:04, schrieb Les Mikesell:

    Rsync is of no use for us. We have mainly big Domino .nsf files which only change slightly. So rsync would not be able to make many hardlinks. :)

  • Am 27.08.2012 22:55, schrieb Adam Tauno Williams:

    I have read the pdf and one thing strikes me:
    –io-chunk-size

    and later:
    ● Memory
    ● 2GB allocation OK for:
    ● 200GB@4KB chunks
    ● 6TB@128KB chunks

    32TB of data at 128KB requires
    8GB of RAM. 1TB @ 4KB equals the same 8GB.

    We are using ESXi5 in a SAN environment, right now with a 2TB backup volume. You are right, 16GB of ram is still much… And why 4k chunk size for VMDKs?

  • Sorry for the top posting. Dedup is just a hype. After a while the table that manage the deduped data will be just too big. Don’t use it for long term.

    Sent from Samsung Galaxy ^^

  • opensolaris spawned ilumnos (the kernel) and openindiana (a complete OS
    based on ilumnos and opensolaris) as well as some other ilumnos based distributions like nexenta.

  • so you need block level dedup? good luck with that. never seen a scheme yet that wasn’t full of issues or had really bad performance.

  • Am 28.08.2012 um 10:03 schrieb Rainer Traut:

    can this endeavor ensure the consistence of this “database” files?

  • Rdiff-backup might work for this since it stores deltas. Are you doing something to snapshot the filesystem during the copy or are these just growing logs where consistency doesn’t matter?

    I’d probably look at freebsd with zfs on a machine with a boatload of ram if I needed dedup in the filesystem right now. Or put together some scripts that would copyand split the large files to chunks in a directory and let BackupPC take it from there.

  • If there is a command-line way to generate an incremental backup file, BackupPC could run it via SSH as a pre-backup command.

  • Am 28.08.2012 21:26, schrieb Les Mikesell:

    Yes, there is commercial software to do incremental backups but I do not know of commandline options to do this. Maybe anyone?

    Les is right, I stop the server, take the snapshot, start the server and do the xdelta on the snapshot NSF files. Having that minimal downtime is ok and acknowledged by the customer.

  • I found some more stuff on a IBM site talking about the API (has to be called from software, not command line) to generate and keep track of transaction log files which the backup software archives. nothing about de-dup.

  • The better option for ZFS would be to get a SSD and move the dedupe table onto that drive instead of having it in RAM, because it can become massive.

    Thank you,

    Ryan Palamara ZAIS Group, LLC
    2 Bridge Avenue, Suite 322
    Red Bank, New Jersey 07701
    Phone: (732) 450-7444
    Ryan.palamara@zaisgroup.com

    —–Original Message—

  • It depends on size of the data that you are storing and the block size. Here is a good primer on it: http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

    As a quick estimate, about 5GB per 1TB or storage for SSD. However I believe that you would need even more RAM since only a 1/4 of the RAM will be used for the dedupe table with ZFS.

    Thank you,

    Ryan Palamara ZAIS Group, LLC
    2 Bridge Avenue, Suite 322
    Red Bank, New Jersey 07701
    Phone: (732) 450-7444
    Ryan.palamara@zaisgroup.com

    —–Original Message—

  • At our shop we have used quadstor – http://www.quadstor.com with good amount of success. But our use is specifically for vmware environments over a SAN. However it is possible (i have tried this a couple of times) to use the quadstor virtual disks as a local block device, format it with ext4 or btrfs etc. and get the benefits of deduplication, compression etc. Yes btrfs deduplication is possible
    :-), i have tried it. You might need to check on the memory requirements for NAS/local filesystems. We use 8 GB in our SAN box and so far things are fine.

    – jb

    Rainer Traut

    writes:

LEAVE A COMMENT