Mbox Files – Can They Be “compacted”?

Home » CentOS » Mbox Files – Can They Be “compacted”?
CentOS 11 Comments

I don’t know a tool to compact them, but I would consider converting them to Maildir. Although they won’t need less space, handling them will be easier.

– Chris

11 thoughts on - Mbox Files – Can They Be “compacted”?

  • In the context of the OP, when mutt tries to deal with a message (e.g., deleting, moving to a folder), it can be boatloads faster, since handling the message works on a small file which contains just that message. Deleting a message from an mbox mailbox, for example, requires rewriting the entire changed mbox file to disk (minus the deleted message). Deleting a message from a Maildir mailbox is just removing one file from a directory.

    –keith

  • HOWEVER. When a directory grows too large, the OS can take a long time to seek through the directory, which can cause its own set of problems. And this makes cleaning out a maildir directory selectively a real pain. Maildir really could do with a hashing mechanism.

    –Russell

  • some file systems are better at this than others… like, xfs does quite well with 1000s of small files in a directory.

    I wonder what thunderbird uses? I have 12000 messages in my ‘CentOS’
    folder, 24720 in another folder, yet it seems quite snappy to find and delete individual messages.

  • We have been using Maildir with courier-imap for decades, and haven’t had an issue with this. My security folder typically has
    25,000+ messages for the last 7 days messages, and accessing either with IMAP or directly with mutt isn’t a problem.

    I have written various scripts over the years to convert from various mail storage formats ranging from SCO’s horrible ctrl-a delimited through the U.W. IMAP, and ones that query other IMAP
    servers to convert their folder structures to local Maildir.

    Maildir is generally very easy to handle with standard *nix command line tools. We have moved mail servers for some regional ISPs by rsync’ing with tens of thousands of email customers by rsync’ing from the old server to the new one to get the bulk of the mail across before cutting over to the new machine. Then we shut the old server down, change the DNS to point to the new one, and finally do a new rsync –delete to update the new machine. There’s a period where some deleted messages may reappear on the client’s email before the rsync is complete, but all new messages appear immediately.

    Bill

  • As some have noted, modern filesystems are better at this than ones such as ext2. However, even in the best of cases, there are still situations where maildirs with a lot of messages are awkward to handle. Specifically, if you’re trying to find specific messages based on criteria that are not easily discernable from the inode, for example, things with attachments. The awkwardness comes from the fact that the shell has a maximum argument size, so you can’t use *. You have to use a bit more script-fu, such as find, etc.

    Even if there aren’t huge issues with doing this, it’s an easily fixed thing. Allowing directories to have hundreds of thousands of entries as a matter of course, even if it’s something that causes no issues in many cases, to me is an architectural issue.

    But then, I noticed my beard is starting to turn grey the other day, so maybe I should just get out the COBOL and tell everyone how we did it when I was a kid.

    –Russell

  • Even if modern systems sort-of handle it, it still seems like a bad thing to do when you consider that opening a file for writing has to atomically decide whether that name already exists before creating it
    – so other concurrent create/delete operations have to be blocked.

  • the better file systems (xfs, zfs, ntfs at least) use a b-tree directory structure, so finding a filename out of 10s of 1000s is very little overhead.

  • This will be bad with an mbox mailbox too. Actually it’ll be worse, because it’ll be too hard to tell which message the grep hits.

    –keith

  • Worse, if the dir gets too big, even after files are deleted it can be very slow. I had one case with >1,000,000 messages in a single maildir
    (spam on steroids, was getting 80,000 messages per hour overnight);
    after it was cleaned out to <1,000 messages it still took several minutes to ls the dir, and the machine's responsiveness went through the floor. Copying to a new dir and renaming fixed the slowdown; the directory was >50MB (the directory itself, not its contents).

    I’d rather have mbox for plain text e-mail storage, and a database for something really high performance.