Friday, November 17, 2006

What is Anti-Spam?

There’s a lot of argument as to which “anti-spam” techniques are legitimately so called. In this article, I’d like to consider what constitutes an anti-spam technique in an ideal sense, then consider the various practiced approaches to spam mitigation in that light, drawing conclusions as to how we should frame the “anti-spam” discussion.

Classifying Spam

For the purposes of this discussion, let “spam” refer to “unsolicited bulk email”. Not everyone agrees on this definition, but it’s by far the most widely accepted, and without a working definition we won’t be able to define “anti-spam”. Thus, an email message is spam (for our present purposes) if it meets two criteria [ref: Spamhaus Technical Definition of Spam].

1. Bulk: the recipient’s personal identity and context are irrelevant because the message is equally applicable to many other potential recipients.

2. Unsolicited: the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent.

It’s important to note that both these criteria must hold for a message to be spam. Many legitimate and wanted mailing lists are “bulk” in nature, and some personal communications are not explicitly requested but desired nonetheless. Point number two does not have to hold for every single recipient: the message is spam in those instances where both points hold, and not otherwise. It follows that the exact same message can be spam for one person, and not for another.

The criteria are not highly precise. In point number one, the question of personal relevance has disputable edge cases. In point two, there may be a question as to whether a particular message was covered by the terms of the permission. From this latter observation, it follows that simple whitelists or subscriptions aren’t entirely sufficient for expressing bulk mail requests: a recipient may request bulk mail on a particular subject, for example, and justifiably consider messages spam when they stray from that subject. The question as to whether a particular message is on a particular subject also has disputable edge cases.

This lack of precision doesn’t prevent us (as human beings) from determining with some confidence whether an item is spam or not. The impersonal (or incorrectly personal) nature of bulk mail usually makes it obvious when point one holds. We can simply ask the question, “is this message personally relevant to me?” Point two is a judgment easily made from knowledge of what we have and have not requested, although disputed edge cases may require arbitration to resolve.

The criteria don’t reduce to simple, mechanically-detectable conditions, however. Particularly large email providers can get a genuine idea of bulk delivery (which implies condition one) by comparing incoming messages across accounts, although this technique can be hampered by messages salted with random elements. Permission can only be expressed in its most explicit form—a whitelist—and the terms of permission can’t be anything nuanced. Unsolicited messages can be detected at “spamtrap” addresses (which solicit nothing), but we can only surmise that other recipients of the same message did not request it.

Theoretically Ideal Anti-Spam

An ideal anti-spam system rejects messages which are both bulk and unsolicited, letting pass those messages which are of specific personal relevance to the recipient (not “bulk"), and those which the recipient has expressly requested (not “unsolicited"). When phrased in these terms, spam filtering is obviously a task for a well-informed intelligent agent of immense sophistication—quite beyond our current ability to construct. Anything less is a weak approximation at best.

The system described so far is ideal in the sense that it keeps spam out of a recipient’s inbox, but it says nothing of network and computing resources consumed in the process. A system that accepts all mail and then discards the portion which is spam wastes significant resources on mail that will ultimately be discarded. This is the hidden cost of spam, and it can be arbitrarily large, since it depends on how much spam other parties send to the recipient. An ideal system must address this cost: it must not only be perfectly accurate, but also perfectly efficient. In the ideal case, each incoming spam is rejected at no cost to the recipient. Only under these conditions is the system guaranteed to scale under increasing spam load.

To address this, the hypothetical intelligent agent could operate at the sender’s system, preventing unwanted data from entering the network at all. Unfortunately this seems practically untenable for several obvious reasons, not the least of which is the cost of replicating the agent at every prospective sender. But in order for the agent to operate from the recipient’s system without the waste inherent in the “accept then drop” approach, it would need to engage with each potential sender in a very light-weight protocol for determining whether a candidate message is personally relevant or requested, prior to accepting the actual text. I can’t even imagine how a protocol would meet these requirements, let alone be reliable in the face of a hostile sender. The situation seems intractable.

If an ideal anti-spam system is technically possible at all, it’s firmly in the realm of science fiction for now.

Compromise

By sheer necessity then, real “anti-spam” systems are weak approximations of the ideal, and as such they do not have to work by detecting the two key properties of spam: “bulkness” and “unsolicitedness”. Any technique that has approximately the right outcome can be considered. However, if we are going to settle on something less than perfection, we need to make compromises, and not everyone is going to be satisfied with the same set of compromises.

There are those who believe that any measure for stopping spam should have as its first goal, “allow and assist every non-spam message to reach its recipients,” [ref: Verio censored John Gilmore’s email] with the implied corollary that any technique which might block a non-spam message is unacceptable—damnable “censorship”, even. At the other extreme, there are those who believe that the only cure for spam is to banish spammers from the Internet, and that the effective means to this end is to threaten spammer-friendly networks with excommunication—and impose it where necessary, collateral damage notwithstanding.

These are ideological extremes, and never the twain shall meet. It’s important to recognise when an argument about anti-spam has made the transition from being an argument over technique to an argument over ideology. Technique can be discussed in a more or less scientific manner, using metrics like false positives and negatives, cost of computation, network bandwidth consumed, and so on. Ideological differences have no such metrics, and need to be debated separately lest they prevent any kind of agreement being reached anywhere.

Most players lie somewhere in between these ideological extremes. As such, they grudgingly adopt a certain ratio of false positives to negatives as “acceptable”, recognising that too much spam (by way of false negatives) can result in the loss of wanted mail anyhow. Everyone attempts to minimise both counts, but the acceptable ratio between the two will vary according to taste and need.

Some of the actual techniques used will now be considered.

Text Analysis

Text analysis judges whether the message is spam purely on the basis of its content. It depends on spam occupying a sufficiently distinct text-space from non-spam. Unlike a human judgement as to whether a message is “bulk”, this approach does not generally involve any understanding of the message content, just statistical analysis of the text. The accuracy of this method varies widely depending on the algorithm used and the characteristics of the incoming mail. It can be quite effective, but would be a poor choice for an “abuse” address which is supposed to accept spam complaints.

A variation on statistical analysis is the detection of very particular features in the message which are believed to be unique to (or at least strongly characteristic of) spam. This examines specific individual traits, as opposed to detecting an overall pattern characteristic of spam. When spammers become aware of such tests, it is generally easy for them to side-step them, but the approach can be highly effective in the short term, especially in the heat of a virus outbreak.

All methods involving text analysis operate at the recipient’s system after receipt of the message text. As such, they do nothing to address the hidden cost of spam, and any processing effort expended in evaluating the message adds to that hidden cost. Any “bounce” message generated as a result of rejecting the message at this late stage amplifies the cost further, and will probably constitute additional spam if the source address was forged. Filing suspected spam in a separate folder avoids some of the costs (and the possibility of generating more spam), but usually results in delayed and/or less reliable discovery of false positives relative to bouncing the message.

Source Address Blacklisting

Source address blacklisting is an aggressive approach which refuses all mail from sources which have a known bad history of sending spam, a bad reputation for the same, or some other feature which warrants blacklisting as a bad risk. There are also other applications for general lists of IP addresses, but refusing delivery of mail before “DATA” in SMTP is the application I wish to discuss here. Many sites maintain private lists of addresses which are no longer welcome, but the better known (and more controversial) instances of blacklisting involve publication in the domain name system.

The accuracy of the approach depends entirely on the portion of spam in the email emitted by the blacklisted site. If a site emits nothing but spam, then the technique is perfect, but this is rarely the case. Unlike most anti-spam techniques, blacklisting reduces the hidden cost of spam by preventing transmission of the message. False positives (and true positives, for that matter) are brought to the attention of the sender as non-delivery notices if the sender’s systems are standards compliant. As all mail is refused, there can be no such thing as a false negative.

But to view blacklisting as a purely technical anti-spam strategy is to miss an important point: the potential social impact of public blacklists. Public blacklists do not have any effect in and of themselves: they are merely published lists of addresses. It is how people use these addresses which generates the impact. If a blacklist publishes the addresses of networks that do not meet certain standards of behaviour, and many people use this blacklist to selectively permit incoming mail, most network operators will be faced with a choice: conform to the standards, or suffer reduced email connectivity. Thus, blacklists are a means of applying peer pressure between independently operated networks.

The full impact of source address blacklisting as an anti-spam technique can only be appreciated in this light. The social pressure it brings to bear encourages many major players to keep their acts clean, and all mail recipients therefore enjoy some benefits of blacklisting whether they use it or not. Without it, many networks might welcome spammers as high-use customers; as it stands, many email service providers are good actors, taking reasonable measures to ensure that their networks are not spam sources, and the threat of blacklisting is what motivates most of them to expend this effort.

Other Techniques

Whitelisting is effective as an anti-spam technique, but it is overkill. It eliminates all sources which are not pre-approved, and so long as all the approved sources can be trusted to operate within the bounds of acceptable behaviour, it eliminates spam. It also eliminates any possibility of using the email address in question as a means of introduction. In the absence of a sender identification system, there exists the possibility that a spammer can circumvent a whitelist by forging a whitelisted address, but the spammer would have to obtain information about individual whitelists in order to exploit this loophole. Whitelists can be implemented such that they address the hidden cost of spam, given certain constraints on design.

Greylisting eliminates those senders which attempt delivery in a “hit and run” manner, not reattempting delivery in accordance with standards. This has nothing to do with the characteristics of spam in a direct sense, but it so happens that many spammers use “ratware” delivery systems which are egregiously non-compliant with regards to standards, and this technique efficiently prevents communication of messages from such systems. It also introduces a delay in which other data (such as blacklist entries) may become available, and this is the only long-term benefit of the approach if spammers fix up their standards compliance. It can also cause legitimate mail to be delayed, of course. Delaying (or preventing) delivery in this manner is relatively light on resource usage.

Sender identification systems such as SPF, DKIM, and many others attempt to create a verifiable association between an email and some domain name. Once again, this has nothing to do with the direct characteristics of spam, and the presence of a verifiable sender identity says nothing about whether the message is spam or not. Even so, a certified sender identity offers one more datum on which to blacklist or whitelist messages, and can mitigate some of the problems associated with other activities. Known identities can, in this manner, be given specific treatment without risk of misapplication. For example, known good actors can be exempted from filtering without risk of admitting bad actors with false identity. Spammers frequently make false identity claims, but they can stop lying (and keep spamming) if so obliged.

Challenge/response, in its broadest sense, attempts to determine that some source address of the message is monitored by a human being capable of taking some requested action. This effectively precludes the possibility that the message is sent in bulk, in most cases. There usually exists the possibility that the challenge message itself will constitute spam, since the address to which the challenge is sent may be a third party who has not expressly requested the challenge, and challenges qualify as “bulk” under our definition—and can be sent in arbitrarily large quantities to boot. The transmission of this extra data generally means that such systems increase the hidden cost of spam. They also create extra work for the recipient of the challenge as a matter of course, whether that challenge constitutes spam or not.

Conclusion

It seems that an ideal anti-spam system is a practical impossibility, given our working definition of “spam”. Every practical approach we have to the problem attacks it obliquely, rather than directly. I would therefore encourage a liberal view of what counts as an “anti-spam” system. It is shallow criticism to say of a certain approach that “it is not an anti-spam system” when all this means is, “it has no immediate impact on the state of my inbox”. The state of individual inboxes is merely the most obvious part of the problem.

Indeed, rather than frame the discussion as “anti-spam”, I suggest we consider the broader picture of “email systems and their properties, particularly in relation to hostile or abusive participants”. In this light, for example, sender identification systems can be seen as a means to prevent senders from making false identity claims, whereas they are seen as largely irrelevant when the discussion is “anti-spam”. A system capable of sender identification has clear benefits over one that does not. Should we ignore these benefits simply because they don’t relate directly to spam prevention?

Spam is a symptom—a symptom of a sick society, ultimately—and email systems can mitigate or exacerbate the symptoms, depending on their properties, but never fix the root cause. Thus, in the end, all we can ask of an email system is that it mitigate the harm caused by spammers and other miscreants as much as possible. In considering any approach to email, we ought to judge it on its own particular benefits and costs in this regard. The benefits aren’t limited to “a cleaner inbox”: they may consist in generally reduced costs to recipients, the ability to offer preferential treatment to good actors, the social clout to ostracise spammers, and so on.

Let’s learn to appreciate the hidden costs and benefits of these techniques, and to think outside the inbox.

No comments:

Mr.Shashi kiran