Hacker News new | past | comments | ask | show | jobs | submit login
"mbox" is a family of several mutually incompatible mailbox formats (ntlworld.com)
77 points by gnosis on Jan 7, 2012 | hide | past | favorite | 36 comments



Interesting URL there. A dot after the tld, and hosted on ntlworld, which was once the UK's most hated ISP (now owned by virgin and hated very slightly less.)


A dot after the tld

That's actually how it's supposed to be done:

http://www.dns-sd.org/TrailingDotsInDomainNames.html

"It's a little-known fact, but fully-qualified (unambiguous) DNS domain names have a dot at the end. People running DNS servers usually know this (if you miss the trailing dots out, your DNS configuration is unlikely to work) but the general public usually doesn't. A domain name that doesn't have a dot at the end is not fully-qualified and is potentially ambiguous. This was documented in the DNS specification, RFC 1034, way back in 1987 ..."

but the site works just fine without the trailing dot as well.


> but the site works just fine without the trailing dot as well.

In your current environment! And probably in all the environments connected to the Internet. But potentially without the dot it would be perfect valid for this to route elsewhere in someone's environment. :)


Netpick: "resolve elsewhere," not "route elsewhere"


Upvote for correction. Thanks!


The dot after the tld is optional (if you don't need to specify a fully-qualified DNS name). Try for example: http://news.ycombinator.com./item?id=3438428

EDIT: I learned something new today, adding the trailing dot can be a good idea:

http://homepage.ntlworld.com./jonathan.deboynepollard/FGA/we...


https://news.ycombinator.com./

Chrome doesn't accept the certificate when the trailing dot is in the request.

I bet if you configured your site to redirect to a URL with a trailing dot, it would break a lot of badly written robots. I'd be interested to hear about how it affects the major search engines before I'd want to implement it on any important sites.


A side question: what's the best way to build an email message store which stores millions of messages (and adding tens of thousands on average each day) on disk so that it can support:

1) Fast looking up messages sent/received by a particular email address, time, tag, priority, and a handful other properties

and

2) Appending thousands of new messages per second at peak?


Maildir (or a very similar one-message-per-file store) plus an index of some kind.

If not for the continuous addition of messages, I'd suggest notmuch (http://notmuchmail.org/), which does a great job of indexing in any way you could want. Unfortunately, as far as I know notmuch does not have an "online" indexing mechanism; someone wrote a notmuch-deliver that adds a message and indexes it, but I doubt it would meet your performance requirements. You could probably make notmuch do what you want, but you might want a new indexing mechanism instead.

I still think you want something very similar to Maildir underneath though.


I was using the one-message-per-file approach, but keeping millions of small files in a filesystem is quite a challenge: I have to worry about available inodes, and listing/manipulating these files takes minutes!


On a relatively small ext4 filesystem here, I have tens of millions of inodes available, and I haven't done anything to tweak that. On a larger filesystem, with a slightly bumped number of inodes, I'd expect several hundred million inodes, more than enough to store all the messages that'll fit in the available space.

As for listing and manipulating, I'd suggest not ever listing all the files. Access them by name only, which should require near-constant time no matter how many files you have. Manipulating the files should take similarly little time. On the occasions that you do need to list all the files, you might find https://www.olark.com/spw/2011/08/you-can-list-a-directory-w... interesting.

A pile of RAM helps, to keep as many directory entries in the dcache as possible.


Email can use up inodes quickly. A thousand users with a thousand emails each is a million files in Maildir. Plus three directories (cur/new/tmp) for each mail folder.


Folders seem unlikely to use up a significant number of inodes compared to the messages. As for messages, how many users do you expect to serve from each mail server? I think you'll hit the limits of whatever service you want to provide before you hit the limit of inodes in your mail store.

(Or you could always encrypt each email and throw it at an object storage system similar to S3. ;) )


I did millions upon millions of small e-mails using Maildir in a filesystem in 2000 (accounts for 1.7 million users on a single filesystem). It's only a challenge if you pick a filesystem with pathological behaviour for directories with many files and/or with small and hard limits for number of inodes (in our case we used ReiserFS 3).

The biggest problem is read bandwidth for your disks. Write IO is likely to be constrained by available network bandwidth unless you do something stupid. Read bandwidth constraints for e-mail are generally easily mitigated by creating an index (so you only need to open or stat the message files when opening individual messages, not to, say, list message sizes in a POP3 server or present an index view on a web interface).

If you need to list/manipulate those files other than writing them once and reading them when a user actually opens that specific message, or deleting them when the user actually deletes the message, then you're doing something wrong (e.g. you didn't create a sufficiently flexible index).


I think the best system would be one inbetween mbox and maildir. You specify a filesize limit of say "10MB". Emails are repeatedly added to the same file, until it hits 10MB, then a new file is created. Emails larger than 10MB would not be split over multiple files. An index is kept of all of the messages. When a message is deleted, rather than immediately removing it from the file and blocking like most mbox implementations work, it should just be flagged as deleted in the index. A background process can clean up these messages when the file is not in use.


Seems like in such a system you've gone to a lot of work to recreate the purpose of a filesystem: to keep the data of different files separate from each other. Why do you want to keep emails together in 10MB chunks?


To address the limited inode issue


What you describe is essentially a database for email. You might have a look at http://www.dbmail.org/


This is true. Maildir and mbox are also essentially databases for email.


OK. Maybe I should have said DBMS? I'm speaking specifically about the management of the back-end storage files, e.g. splitting storage over multiple size-limited files, tracking and re-using free space, etc. are all things that DBMSs provide, without a need to reinvent such things for the purposes of storing email.

If there's a better term to describe this I'd be glad to learn it.


There are some systems based on the Xapian full-text indexer, including my own 'mu'[1] which might be able to do this. It expects the messages in a Maildir, and it can handle > 150 messages per sec, on fairly slow hardware - so after a peak of thousands of new messages, it would take a few seconds to catch up. Tens-of-thousands per day would be no problem. There have been people using [1] for a few million of messages without any problem, but I'm not sure how well it scales beyond that.

[1] http://www.djcbsoftware.nl/code/mu, or 'maildir-utils' in debian/ubuntu


If you only ever want to append messages, and never want to mutate or remove them, I would do this scheme in two parts: 1) an append-only message store; 2) a set of indices stored alongside (1). To append thousands of messages per second, I would probably want a system with multiple writers appending to multiple files (mbox) or directories (maildir), and cycling/archiving the mailboxes as they get large.

As for (2), I know a lot less about indexing messages than I do about storing them, but I would try out an off the shelf SQL or noSQL system and measure the performance, before trying to build anything fancy.


You could look at http://notmuchmail.org/ which builds on top of maildir.


See also JWZ's article on the "Content-Length" header (mboxcl and mboxcl2 formats from the original article)

http://www.jwz.org/doc/content-length.html


One way to get around the different escaping rules between mboxo and mboxrd is to insert a space (0x20) at the beginning of any line that could be misinterpreted. This eliminates the need for any further escaping. Leading whitespace is almost always ignored by mail clients when displaying the message, especially when the text is flowed.

I've received received e-mails that were formatted this way, though I'm not sure exactly which software, whether client or server, is responsible for doing it.


Now you now have n+1 standards to keep track of.

As a side effect, this breaks every reader which is unaware of it and handling binary content in mail (many, many messages) or languages or encodings which are space sensitive.


Binary content and non-ASCII characters are almost always base64-encoded, and base64 never produces a From_ sequence (because of the space).

Anyaway, I'm not endorsing the leading-space technique. I'm just observing that it's out there in the wild.


Another wrinkle: some mailers stuff full 8-bit chars into mbox which will choke some parsers and end parsing prematurely (most expect 7-bit ASCII).

But it beats mdir handily in one way: if you're dealing with mailboxes with 100s of thousands of messages it is way more efficient.


way more efficient along which dimension? Space, or processing speed? By roughly how much? Enough to be worth the costs in potential data loss?


Efficient use of inodes, perhaps?


Each mdir email is a file, so depending on your OS it consumes disk fragments very quickly and relies on your OS file handlers to allocate and load messages. mbox as a mapped file and a good index blows it away in speed and space usage.


Very true, but reclaiming space in the middle of a huge mailbox is dramatically worse in the mbox case.

Doing time-machine style rsync backups is also dramatically more space efficient in the maildir case.


Why dramatically more? rsync does a perfectly fine job transmitting and storing just the diffs on an mbox file.


With the time-machine style backups, every time you append an email to your mbox the whole thing gets backed up. Yes, the rsync protocol makes the wire transmission efficient (only really matters if your backup server is remote) but the fact that your 2Gig mbox file gets backed up every night instead of hard-linked means it's not space effective on your backup disk.

A mail-dir, on the other hand, works nicely with this type of backup. All the old mails never change so they get hard-linked and only the new mails take up any new space on the backup.


"time-machine style backup" seems to be a marketing term, not a type of backup.

Do you mean a hard link style backup? Where diffs are never stored, you just keep making hard links of the files? And if they change even one byte you store the file fresh?

There are plenty of solutions to this, the simplest being storing reverse diffs - so the latest file is stored plain, but older ones can be generated from diffs (since doing so is rare).

You can also use modern filesystems like btrfs that can do COW and block de-duplication and store only changed blocks in such a way that the file appears to be complete, but actually stores only what changed.


Yeap, hence "it beats mdir handily in one way" - mdir is way more modern and flexible.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: