Meta Ideas: Comments on Spam and Natural Language Processing

Tuesday, October 07, 2003

Comments on Spam and Natural Language Processing

I think that there is a highly interesting link between the research on natural language processing and the war on spam. I'm not sure if both sides are aware of the extent of this relationship; this is both good and bad. Bad, because the two fields use different methodologies, and with a small number of exceptions, don't talk too much and don't share information as they could. But it is also good, in the sense that different approaches are being sought, out of the academia, by people that have a real problem in their hands and are very practical about it.

Natural language processing

Natural language processing is one of the still unfulfilled promises of the information age. During the 50's and 60's, it was considered safe to predict that we would have systems capable of holding a conversation as soon as before the turn of the cebtury. Well, we're in 2003 now, and "2001"'s HAL still seems like a distant dream. I'm not talking about the most advanced stuff, such as speech recognition or artificial reasoning; the problem that we can't even have a system that understands the meaning of written language.

Early natural language theorists focused their work on the structural aspects of language. It was assumed that it was the promising approach, as it would lead to deterministic algorithms to analyze and extract the meaning out of any text. However, the importance of the structure was initially greatly exaggerated. Over time, the importance of context was recognized, and it became clear that the understanding of the language, even in its written form, was a much more difficult problem than originally thought.

One of the effects of this realization was the continuous reduction of the goals. At first, it seemed to be possible to have computer systems to do simple office work like reading and writing business letters. When it became clear that it was a really hard task for a computer - even with all the improvements on processing power - the goals were reduced; and we're still limited today to spell checking and fairly simple translation applications. This example helps to illuminate one of the main problems with the development of natural language processing systems: there is no usable intermediate step. Today's systems are really so simple, by human standards, as not to qualify nowhere near to the 'intelligent' level. And we seem to be really very far from any practical application for business purposes.

The lack of business applications makes much more difficult to attract investment to the development of such systems; there is much ground to cover, and limited chances to make money in the short term. Current research is now mostly limited to non profit institutes and universities - where it is treated as pure science - or to big companies like IBM and Microsoft, that can invest with long term profit in mind.

(Of course, Google now qualifies as one of the big companies, and they are going fast in this direction. Starting as a search company, they're now looking very far beyond, and NLP applications are on sight - specially now that they are investing on things such as blogging and webmail applications)

The anti-spam war

Right now, there is an arms race involving spammers and their opponents, the anti-spam sofware developers. Eric Allman - the author of the widely used sendmail software - was quoted recently as saying that this is a war where everyone loses, except perhaps the arms dealers. While Mr. Allman prediction may seem overly catastrophic, the situation is pretty much as stated - but we may be learning something very important in the process.

The war on spam is a show of Darwinian evolution at work. Every new generation of spam is matched by a new generation of anti-spam software. For every new technique of the anti-spam arsenal, spammers are quick to react with a new and improved attack. The main difference between this war and the one being fought by virus and anti virus writers is that the later focuses on the evolution of code, while the prior focuses on the evolution of text recognition. The goal of spammers is to craft messages that can be understood by human readers, but not recognized by the automatic anti spam filters. Both sides also follow strict economic guidelines - doing only as much as needed, at the least cost possible. This approach is perfect for the evolutionary game, and will lead to a slow but constant flow of improvements for both sides.

Many of the techniques recently developed to filter spam messages are beginning to show the improvements of our understanding of natural language processing. First generation filters used simple word filtering techniques, but this type of filter was prone to false positives (blocking legitimate email as if it was spam). Bayesian filters are a improvement, because they do not rely on static lists of words; instead, they can learn from both legitimate and spam messages. But now, spammers are improving their game once more. Some of the new trends: the usage of random or incorrectly spelt words; and better written messages that can pass as legitimate messages. The first technique was devised to exploit weaknesses of the word-based filters, either static or Bayesian ones, but are still readable. That we can read words with punctuation characters in the middle is not totally unexpected. But it has been shown recently that we can still read words... even when some of the letters are shuffled. Recent spam messages are fully using this knowledge to avoid being filtered.

But it is the second trend that is more interesting (if disturbing). Spammers are getting better at disguising their messages with text that reads like legitimate email. In some cases, the messages are subtle. We (the human readers) know that they are spam, but it is difficult to point out any particular feature of the message that tells that it is a spam. Some people have pointed out that, instead of working on filters that recognize spam, we should be doing the opposite: writing filters that recognize legitimate email, and discard everything else as spam. Either way, it is clear that the future anti-spam filters will have to be much better (and clever) than the current generation. Natural language processing is now the way to go.

What surprises does the future reserve for us?

The war on spam presents a unique opportunity to watch the invisible hand of evolution as it guides things. It may also help to put some light on our understanding about the evolution of human language. How did we evolve such a complex and well structure system? Nature does not work by designing a plan and following it; it works by trying random stuff zillions of times, until it finds something that works, that is cheaper and more effective than the alternatives. If there is some structure in the end result, it is perhaps an accident of the way the solution evolved. Had we started from a different point, the final solution could have looked completely different.

Evolution is all about taking one step at a time. Today, the anti-spam software development presents a unique intermediate step for the research on natural language processing. Once the war is started, it is a matter of time until better solutions get developed. I'm sure that the next few years will watch as new techniques are developed by both sides. By definition, spammers will be always slightly ahead, because they are the ones who have economic incentive to do it. But anti-spam vendors will catch quickly, and our knowledge will evolve at the same pace.

But there may be more problems in the future. As processing power and bandwidth grow cheaper, it may become feasible to send individually customized messages for each and every person. In the long term, we may end up in a situation where we may have spam that looks like legitimate messages. For example, it is possible that spammers will plant AI bots at services such as Friendster with the sole purpose to start chatting with other people, and then try to sell something to them in a subtle way. How can we tell if we are meeting the date of our dreams, or if we're being cheated by a bot? That will be probably the ultimate Turing's Test - one that I'm not willing to participate myself.

¶ 2:42 PM

Comments:

Great post. Let's see where this war is going next.

# posted by

Fabiano G. Souza : July 27, 2004 at 4:53 PM

Bush is forever saying that democracies do not invade other countries and start wars. Well, he did just that. He invaded Iraq, started a war, and killed people. What do you think? Is killing thousands of innocent civilians okay when you are doing a little government makeover?
What happened to us, people? When did we become such lemmings?
The more people that the government puts in jails, the safer we are told to think we are. The real terrorists are wherever they are, but they aren't living in a country with bars on the windows. We are.

# posted by

Anonymous : February 18, 2007 at 10:00 PM

Nice site!
viagra
http://e-courses.cerritos.edu/ssmout/HED%20103/HED%20103/0000025e.htm?viagra
[url=http://e-courses.cerritos.edu/ssmout/HED%20103/HED%20103/0000025e.htm?viagra]viagra[/url]
cialis
http://e-courses.cerritos.edu/ssmout/HED%20103/HED%20103/0000025d.htm?cialis
[url=http://e-courses.cerritos.edu/ssmout/HED%20103/HED%20103/0000025d.htm?cialis]cialis[/url]
soma
http://e-courses.cerritos.edu/ssmout/HED%20103/HED%20103/0000025f.htm?soma
[url=http://e-courses.cerritos.edu/ssmout/HED%20103/HED%20103/0000025f.htm?soma]soma[/url]
Best Regards.

# posted by

Anonymous : February 21, 2007 at 1:21 AM

Hello
Keep up this great resource.
I like it a lot! Good work, keep it up. Here a lot of helpful information.

Try this - very useful:

http://xenical.butkel1.org/ xenical
http://hydrocodone.butkel1.org/ hydrocodone
http://celexa.butkel1.org/ celexa
http://ephedra.butkel1.org/ ephedra
http://lorcet.butkel1.org/ lorcet
http://meridia.butkel1.org/ meridia
http://tramadol.butkel1.org/ tramadol
http://carisoprodol.butkel1.org/ carisoprodol
http://cialis.butkel1.org/ cialis
http://paxil.butkel1.org/ paxil
clonazepam http://clonazepam.butkel1.org clonazepam
lortab http://lortab.butkel1.org lortab
lexapro http://lexapro.butkel1.org lexapro
codeine http://codeine.butkel1.org codeine
viagra http://viagra.butkel1.org viagra
vicodin http://vicodin.butkel1.org vicodin
percocet http://percocet.butkel1.org percocet
ativan http://ativan.butkel1.org ativan
oxycontin http://oxycontin.butkel1.org oxycontin
rivotril http://rivotril.butkel1.org rivotril

Thank's!

# posted by

Anonymous : February 27, 2007 at 8:44 PM

About Me