##
Proposal: *100% accurate email filtering is infinitely expensive.*

*Note: I worked this out from scratch on 2012-05-09 to understand a situation at work. That being said, I am sure an analysis of this topic has been published academically before now. I generally don't have access to academic journals, so if you know of a paper that covers this issue, please post it in the comments.*

### Justification:

Let's look at the percentage of actual "real" email to spam email and the percentage of email considered "real" by your filter at an instantaneous point in time:

For simplicity, consider that, at any given time, your filters will either be too aggressive or not aggressive enough. That is, you will either be filtering real mail (R

Over time R

Knowing R

This process of adjustment produces a function, R

Each adjustment is done by some process. That process takes some amount of time. A company can either pay an employee to spend the time to follow that process or outsource that process to another company. In either case, making an adjustment to Rfilter costs a certain amount of money:

Over some time period, you make a certain number of adjustments to R

Since each adjustment costs a certain amount of money, the cost of that period of time grows with the density of adjustments in that time period. To get 100% accurate email filtering, that is, to make R

Thus, a company with finite resources can never have 100% accurate spam email filtering unless every sender announces ahead of time the

Note that Ractual is actually non-continuous since it has an exact value at the arrival of each new email. For any more than a trivial amount of arriving email, the time delta between email arrivals allows us to consider it continuous.

Although you cannot achieve 100% accuracy, you can drive up the accuracy of R

*filter*is above R*actual*) or you will be letting spam email through your filters ( R*filter*is below R*actual*). In reality, both happen at the same time. In this simplified case, if your filter is too aggressive or too lax, you adjust it to push R*filter*closer to R*actual*.Over time R

*actual*varies:Knowing R

*actual*implies 100% accurate filtering. Knowing R*actual*implies that the sender of each email notifies its recipients of the*real*/*spam*nature of the sent email before sending it and that such a distinction can be made. We know that spammers do not do this and we know that what recipients consider spam varies by recipient, so the best we can do is to judge emails coming into an email system and then make an adjustment to R*filter*, if needed.This process of adjustment produces a function, R

*filter*, that approximates R*actual*:Each adjustment is done by some process. That process takes some amount of time. A company can either pay an employee to spend the time to follow that process or outsource that process to another company. In either case, making an adjustment to Rfilter costs a certain amount of money:

*filter*to approximate Ractual. To make the approximation more accurate, you need to make more adjustments to R*filter*in the same amount of time:Since each adjustment costs a certain amount of money, the cost of that period of time grows with the density of adjustments in that time period. To get 100% accurate email filtering, that is, to make R

*filter*= R*actual*, you have to make the period between adjustments equal zero. In other words, you would have to make infinite adjustments to Rfilter in a certain time period, thus making that time period infinitely expensive. This is the case*no matter how inexpensive you make the adjustment process*.Thus, a company with finite resources can never have 100% accurate spam email filtering unless every sender announces ahead of time the

*spam*/*real*nature of their email*and*such a distinction can be made.Note that Ractual is actually non-continuous since it has an exact value at the arrival of each new email. For any more than a trivial amount of arriving email, the time delta between email arrivals allows us to consider it continuous.

Although you cannot achieve 100% accuracy, you can drive up the accuracy of R

*filter*by driving down the cost of each adjustment. Lower cost adjustments can be made more frequently for the same amount of money per a time period. Bayesian email filters have been remarkably effective in this regard.
-Adam (a0f29b982)

## No comments:

## Post a Comment