Tuesday, April 20, 2004

$Þǻm αהּd †hε dε†эc†iδή δf Mæהּ|ήg…

Can e-mail spam help us understand the inherent difference between deterministic decision making and rationalistic decision making? Typically, the contents of a spam’s subject field will give us an indication of the message contained in the actual e-mail. Consider these examples of spam subject lines that may grace our in-boxes:
“save munny!” “àdv=exclusive_äpplying-löans~with-cäsh out ón your home now” “V!@gra.Val|i|um.” “Save upto 1800 bukz on your favourive so0ftwarez.”
The push to sell a drug that rhymes with “niagra” seems to be relentless. In an attempt to stop this invasion of spam we may tell our e-mail system to filter out all messages that contain the word (per our rhyming example) “niagra.” But the spammers try to outwit us by sending a message in which they replace “niagra” with “n!@gra” in the subject field. Our e-mail filter program cannot ascertain that the intended meaning of “ni@gra” is equivalent to that of “niagra.” We may respond by telling our anti-spam filter to stop all e-mail that contains “n??gra,” “*gra,” and “niag*,” in which our use of the wildcard “?” and “*” characters attempts to circumvent the variant spellings of “niagra.” But, once again, the spammers can easily respond with “ni ag ra,” “n.i.a.g.r.a,” or “nye-ag-rah.” In virtually every case, we are able to ascertain the intended meaning of the spammer’s message, whereas the software, operating purely by the determinism of the simple instructions we have given it, will not. This becomes even more interesting when one looks at some of the more complex methods that anti-spam software programs use to filter out spam e-mail. Companies such as BrightMail enlist the use of sophisticated algorithms in their attempt to achieve a 99.9999% accuracy rating in filtering spam. Let’s step back for a moment and look at the process that is going on here: attempts are made by spammers to design e-mail messages in such a way as to allow them to pass through a filter, which has been designed by an anti-spammer in such a way as to block said e-mail messages. Note the manner in which BrightMail describes the process:
“Maintaining high effectiveness rates over time is challenging because spammers are constantly motivated to evade filters. Spammers… continually change their tactics, and time their attacks strategically within narrow dissemination windows. Their sophisticated tools create new spamming techniques, such as the randomization of headers and bodies. For these reasons, many solutions work well for the first few months but then their performance starts to deteriorate. In order to be consistently effective over the long run, an anti-spam solution must be responsive to these changes in attacks and methodologies.” “BrightSig™ technology is the cornerstone of Brightmail’s signature technology. When messages flow into the BLOC, they are compressed using proprietary algorithms into a unique “signature,” which is added to the database of known spam. Using this signature, BrightSig groups and matches seemingly random messages that originated from a single attack. Effective grouping allows Brightmail to create tight, targeted rules without having to write numerous such rules against a single attack. By distilling a complex and evolving attack to its DNA, more spam can be deflected with a single rule. As spammers adapted to BrightSig filtering, Brightmail introduced the next generation signature technology, BrightSig2. BrightSig2 has specific defenses against HTML-based spam, combating randomization and HTML noise (comments, constants, bad tags) that spammers insert to evade filters.”
At first glance, phrases such as evolving, adapting, randomization and, attack to its DNA, make it appear as if the spam / anti-spam process is a good analog to Darwinian Evolution. Yet note that what is going on here is hardly natural selection occurring across random variations. Both sides, the spammers and the anti-spammers, are designing their systems to address specific types of attacks. But I veer off on a tangent… Let’s consider one of the ways in which spamming technology attempts to circumvent anti-spam filters and how it relies on the rationalistic processes found in the human mind. Were you able to recognize meaning of the title of this blog post? Only 8 of the 27 characters used are valid. How about the following line? ΛΛε דh١ήﻶς |† ∟!ќε ǻ ώæ$ε£ Although none of the characters used are valid, you can probably figure out the message. A criticism may immediately be raised regarding the fact that the invalid characters look like the very letters they are impersonating. And, someone may ask, what about the fact that we already have a good idea of the gamut of topics that spammers are likely to push on us? Aren’t we, as we decipher the message, just accepting the data that corresponds with a known word and rejecting the data that doesn’t? Well, yes, we are (and in a manner not unlike that of reading someone else’s handwriting). But that simply reiterates the point that in determining the meaning of the altered words, we are processing the information we see, correlating it with a possible existing meaning, and then coming to a rationalistic conclusion regarding its intended meaning. Note how this is qualitatively different from trying to break, say, a cryptographer’s code. With an encrypted code, such as in wartime, the intent of the sender is to insure that no one but the recipient understands the message. Such a code is typically structured with a set of deterministically based rules that only the sender and recipient are privy to. The message is coded with the intent that, if it were to be received by anyone other than the intended recipient, its meaning would not be readily apparent. In the world of spamming, the opposite is true. The message is coded in such a manner as to (hopefully) pass through a deterministic filter, and yet still be understood by a casual recipient. Such a code does not necessarily have to follow a set of deterministically based rules but, rather, relies on the mind’s ability to interpret the coded data with regards to form, context, phonetics, etc. For the spammer, the letter “a” can be designated by any one of the following characters: “@,” “ǻ,” or “α.” Through the interpretive process, within the context of the passage, the mind is able to conclude that any one of these characters actually designates an “a.” A software program, on the other hand, is forced to rely strictly on the direction given from the algorithms within its written code. One rub is that even though “@,” “ǻ,” and “α” are all interpreted to mean “a,” they each have their own unique meaning (separate from the letter “a”). Another rub is that “ή.i.αgrα”, for example, is the incorrect spelling of “niagra.” Even though we know what the spammer meant, the actual word used is wrong – and we know that it is wrong. Okay, so all we’ve done is show that humans can understand the meaning of an e-mail subject line that may slip through a software filter. It doesn’t prove anything since we already know that e-mail is generated by intelligent design (for the most part). But is that all we’ve done? Consider the argument that a specific meaning is unrecognizable to a deterministically based analysis program unless the program is instructed, through its code, to recognize the meaning as such. The program makes its choices strictly on the results of its deterministic algorithms. With the human mind, that is not the case. The very qualities of the spam subject line that cause it to slip through the deterministic filter are what allow us to understand the intended meaning of the message! In fact, check the process that BrightMail uses to maintain an accuracy rate of 99.9999% that spam e-mail doesn’t make it into your mailbox:
“What is your accuracy rate? 99.9999% How do you maintain this rate? Automated safeguards are built into the technology. For example, rules are tested against a legitimate mail database Manual safeguards protect against overaggressive rules Support for false positive submissions from Brightmail’s user community—300 million strong— inform Brightmail as soon as possible Human review of every false positive received”
דhα†’$ r¦ﻍh†, hựmǻהּ rεv|εώ.

No comments: