Discussion:
[Spamprobe-users] random words within html tags
Alvaro Cardenas
2008-12-11 01:35:48 UTC
Permalink
Hi all,

I am a little bit confused about the html parsing that spamprobe does. In the main description of how spamprobe works it is said that html tags are ignored.
However, I am attaching an example of a recent spam email I received.


This is included within the <body> of the message:

<STYLE> said Gervasio, relate to others and activities plenty of time when they can Atlanta, Georgia.said Gervasio, children are plopped in
Here's some soothing </STYLE>

and this is the result of spamprobe by using the scoring -T option:

Spam Prob Count Good Spam Word
0.9999990 1 0 2920 padding
0.9999973 1 0 999 0px
0.9999931 1 0 388 men
0.9999926 2 0 360 gervasio
0.0000108 1 4 0 already have
0.0000108 1 4 0 we already
0.9999848 1 0 175 relate to
0.9999847 1 0 174 georgia
0.9999769 1 0 115 margin 0px
0.9999767 1 0 114 0px padding
0.9999749 1 0 106 Hsubject_we have
0.9999749 1 0 106 atlanta
0.9999749 1 0 106 others and
0.9999734 1 0 100 are plopped
0.9999734 1 0 100 plopped
0.9999734 1 0 100 plopped in
0.9999734 1 0 100 to others
0.9999726 1 0 97 plenty of
0.9999704 1 0 90 children are
0.9999704 1 0 90 s some
0.9999704 1 0 90 some soothing
0.9999704 1 0 90 soothing
0.9999694 1 0 87 2px
0.9999675 1 0 82 said gervasio


It is clear that spamprobe is using these "hidden" words as part of the classification.

How can I make spamprobe avoid using words that do not appear to the final user?

I know that in this case, spamprobe classifies the message as spam, but I am not sure it will classify messages correctly the next time.

thanks,
Alvaro


_________________________________________________________________
Lancez des recherches en toute sécurité depuis n'importe quelle page Web. Téléchargez GRATUITEMENT Windows Live Toolbar aujourd'hui !
http://toolbar.live.com

Loading...