Thursday, September 17, 2009

(Temporary) Epic Fail

Changing MiniG mail grouping is more tricky than expected.

The previous algorithm worked like this :
- read the subject
- compute its root, "Fwd: Re: Hello sent the 2009-08-07" becomes "hello sent the xxxx-xx-xx"
- simply use an equality comparison on the subject roots.

First version with real threads used an hacked version of subject grouping. It was just a hack on the "equality comparison". Instead of comparing suject roots, it used lists of Message-ID headers and was doing comparisons on the differences between those 2. It worked. For mailboxes < 1000 messages. Sylvain's 100k mails Junk folder took more than 1hour to process. We needed a linear algorithm to do the thread grouping.

The new algorithm works on other headers : Message-ID and In-Reply-To.

It works this way. We maintain a ThreadRoot list which holds known conversations with the message-id's in them

When we detect that messages are added/changed or deleted to a folder, we first process removals. When a ThreadRoot has no more messages, we mark it as dead and flag it for removal.

Then we process updates, this is the tricky part.

We have 2 lists :
- List<ThreadRoot>, all the known thread roots not flagged as dead.
- a List<RawMessage> called leafCandidates where we have all the not yet processed messages.

We then run :


unmerged = 0;
merged = -1;
while (merged < unmerged) {
unmerged = leafCandidates.size();
doMerge();
merged = leafCandidates.size();
}


Then doMerge() does all the job :


for (RawMessage r : leafCandidates) {
if rootsIds.contains(r.messageId) {
flagUpdate(threadRoot, r);
} else if (r.inReplyTo == null) {
createThreadRoot(r);
} else {
ThreadRoot tr = rootIds.get(r.inReplyTo);
if (tr) {
tr.merge(r);
} else {
// corner cases, most problems are not here
// exemple : a mail with an In-Reply-To and the father mail is deleted
}
}
}


This algorithm is mostly linear... Its real complexity depends X, Y, Z where X is how much new mail you receive, Y how much you receive replies to existing emails and Z, how much the user changes flags on existing emails. It can easily process Sylvain 100k spams in 20sec, so let's assume the complixity is OK.

Most MiniG problems on our trunk version are in the "flagUpdate(threadRoot, r)" process. This creates all kind of strange bugs : conversations that still look read, mark as read/unread that only work "sometime".

Flags are still broken, but the new algorith is pretty promising. The load on our test & production mail server is lighter, really lighter. Some tuning of the last corner cases will make it a very good change. In fact the new code while still buggy is so fast that it exposed race conditions in the indexing code ;-)

No comments: