[BayesJunkTool] multi-word tokens
Darxus at chaosreigns.com
Darxus at chaosreigns.com
Mon Oct 6 13:53:40 EDT 2003
>From the information I've been able to find, I'm guessing that the Bayes
Junk Tool does single word, probably case sensitive tokens. I believe
you will find significant improvements in accuracy if you do both one
and two word tokens. Read in all single word tokens, then read in
every pair of words as tokens (using the same definition of word and
nonword characters).
"Brian Burton reports an astonishing 99.96% with his multi-word Bayesian
SpamProbe." - http://www.paulgraham.com/sofar.html (August 2003)
The creator of crm114 recently did some tests comparing single word
token bayesian filtering with a few kinds of 5 word token filtering:
http://sourceforge.net/mailarchive/forum.php?thread_id=3224944&forum_id=32320
Maximum words in a token is configurable in spamprobe, but the author has
told me that his tests have shown that the benefits over two are not
worth the cost.
But what convinces me the most is the difference I've seen between
spamassassin's single word case sensitive bayesian filtering and
spamprobe's 1-2 word case insensitive bayesian filtering. Spamassassin
was about 96% accurate, which left a lot of spam in my inbox. spamprobe
has been perfect for the four days since I installed it, trained on 21
days of email, and not trained on new mail since. If spamprobe were to
mis-classify an email now, it would be 99.86% accurate, but it's still
at 100%.
I'm not personally interested in using mozilla mail. I'm interested
in everyone using bayesian spam filtering, to discourage spammers.
And the Mozilla Bayesian Junk Mail Filter is the only one I know of that
is usable on Windows or MacOS.
--
"I would believe only in a God that knows how to Dance." - Nietzsche
http://www.ChaosReigns.com
More information about the BayesJunkTool
mailing list