Wednesday, February 13, 2008

Bayesian Spam filtering

This is from the blog: Time!

http://ajabgajab.blogspot.com/2007/07/bayesian-spam-filtering.html



Imagine a situation: you are receiving more than hundred Emails. You have to read each of them and classify whether it is good or bad one, for your BOSS.
Stressed? It seems like a stupid question because , now a days, it is not the real situation , right?
You see that there are two folders: Inbox and Bulk(spam). And one more called trash.
Life is so easy!
Eventually there are some, which make the way through.
How is that possible?

If you are receiving less spam, it is the boon of Bayesian Spam Filtering.
It works by learning.
Exactly like we would do, If we face the first condition,
We would start classifying the mails according to its contents and some key words provided by the boss. If confused, with some new situation, feel free to ask the boss. Sub-consciously we will be attributing the spam coefficient to each mail, and finally, to decide whether the Email is Spam or not.
Therefore there will be some training data (lets say) to begin with. Each time we classify the Email, we will become expert so as to classify whether the mail is spam or not. After gaining enough expertise, there are no spams for your BOSS to read (sounds ambitious) . He is also happy that he has to train less and less to classify the incoming mails, as you are gaining the expertise on the environment.
In contrast, you are smart enough, not to mark it spam just by only seeing some key-words used to mark spam. It is the overall mail that will affect your decision. Am I right?
Suppose you change the office, say from management to health. The nature of mails are very different. For example, the term “Pills” may not be spam anymore! While a very good proposal “invitation to join business partnership from africa” is likely to be a spam. If you mark it with the training gained in previous office, you are in trouble!
It would be advantageous for your boss to read all the Emails (including spams) himself than to loose a single (but important) mail.
The advantage of Bayesian spam filtering is that it gets customized with user and the coefficient of spamness differs from user to user.
Well, watch the situation from the eyes of a spammer! You will clearly see the difficulties to spam the mail box. You would be forced to think!
HOW TO SPAM? Some people just can not sleep without spamming.
Because, even if you are able to get through, the way you found will work only once, there is no next chance through the same door. If marked spam (training), there will be no way to that trick for the next time .
Learning makes it possible.
Useful readings:
 
I am highly inspired by:
http://www.paulgraham.com/spam.html
and listening to Prof. Kevin Knuth, Prof. Carlos Rodriguez, Adom Giffin and Roger Pink.
Recommended texts:
 Data Analysis: A Bayesian TutorialBayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians (Chapman & Hall/CRC Texts in Statistical Science)Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica® Support