Wednesday, May 9, 2012

I want 100% Accurate Spam Email Filtering. Why Can't I Have It?

Proposal: 100% accurate email filtering is infinitely expensive.

Note: I worked this out from scratch on 2012-05-09 to understand a situation at work. That being said, I am sure an analysis of this topic has been published academically before now. I generally don't have access to academic journals, so if you know of a paper that covers this issue, please post it in the comments.


Let's look at the percentage of actual "real" email to spam email and the percentage of email considered "real" by your filter at an instantaneous point in time:

For simplicity, consider that, at any given time, your filters will either be too aggressive or not aggressive enough. That is, you will either be filtering real mail (Rfilter is above Ractual) or you will be letting spam email through your filters ( Rfilter is below Ractual ). In reality, both happen at the same time. In this simplified case, if your filter is too aggressive or too lax, you adjust it to push Rfilter closer to Ractual.

Over time Ractual varies:
Knowing Ractual implies 100% accurate filtering. Knowing Ractual implies that the sender of each email notifies its recipients of the real/spam nature of the sent email before sending it and that such a distinction can be made. We know that spammers do not do this and we know that what recipients consider spam varies by recipient, so the best we can do is to judge emails coming into an email system and then make an adjustment to Rfilter, if needed.

This process of adjustment produces a function, Rfilter, that approximates Ractual :
Each adjustment is done by some process. That process takes some amount of time. A company can either pay an employee to spend the time to follow that process or outsource that process to another company. In either case, making an adjustment to Rfilter costs a certain amount of money:
Over some time period, you make a certain number of adjustments to Rfilter to approximate Ractual.  To make the approximation more accurate, you need to make more adjustments to Rfilter in the same amount of time:
Since each adjustment costs a certain amount of money, the cost of that period of time grows with the density of adjustments in that time period.  To get 100% accurate email filtering, that is, to make Rfilter = Ractual, you have to make the period between adjustments equal zero. In other words, you would have to make infinite adjustments to Rfilter in a certain time period, thus making that time period infinitely expensive. This is the case no matter how inexpensive you make the adjustment process.

Thus, a company with finite resources can never have 100% accurate spam email filtering unless every sender announces ahead of time the spam/real nature of their email and such a distinction can be made.

Note that Ractual is actually non-continuous since it has an exact value at the arrival of each new email.  For any more than a trivial amount of arriving email, the time delta between email arrivals allows us to consider it continuous.

Although you cannot achieve 100% accuracy, you can drive up the accuracy of Rfilter by driving down the cost of each adjustment. Lower cost adjustments can be made more frequently for the same amount of money per a time period. Bayesian email filters have been remarkably effective in this regard.

-Adam (a0f29b982)