Saturday, June 10, 2006

Bloom filters and Smart Servers

I just got back from a joint Mercurial / Bazaar meetup. One of the neat things about it was the bloom filter collaboration. Bryan O'Sullivan described bloom filters to us as a mechanism for greylisting. They're a way of cheaply storing large sets of keywords, with the tradeoff that testing whether a keyword is in a set may produce false positives.

Later on, they were discussing their smart server implementation. This is especially interesting to the Bazaar crew, because we are now planning to create our own smart server.

One of the tricky things about smart servers is that they must determine which revisions the server has that are not present on the client side. Mercurial currently does this using a binary search, but this can have up to 15 roundtrips, and roundtrips in network protocols can slow them down dramatically.

It came to me that bloom filters could be applied to this problem, because they allow a set to be represented compactly, and the revisions in a repository can be treated as a set.

Robert Collins stated that bloom filters can also be inverted, so that instead of the risk of false positives, you have a risk of false negatives. This is useful if the receiver sends its filter to the sender, and the sender decides which revisions to send. That way, the sender can never send revisions that depend on unknown revisions. So far, we haven't found a description of how to invert a bloom filter to produce false negatives, however.

If the sender sends its filter, then the receiver decides which tips the two machines have in common, and so false positives don't produce a risk of sending revisions that depend on unknown revisions.

Matt Mackell investigated the possibility of using blooms with Mercurial's smart server, and concluded that a bloom filter of a million revisions with a 1% error rate would take about a megabyte of data-- too much to send all at once. I pointed out that we could accept a higher error rate for this data, because revisions in a set have strong correspondances: if a revision is in a set, then its direct ancestor should also be in the set. Using this rule, we can eliminate matches where the direct ancestor is not in the set, reducing the error rate to 1% again. Unfortunately, a bloom filter with a 10% error rate is still 500k or so.

However, it's worth noting that it's rarely necessary to send a million revisions; instead, a bloom filter of a smaller number of revisions can be sent, since repositories are usually very similar.