Bayes eternally alive
So far methods of fighting with the spam, from the users of mail systems point of view,
appeared to be ineffective. However, it seems that the solution of the problem are statistic functions, discovered by Thomas
Bayes in XVIII century.
Translation: Aleksandra Malak
It looks innocent at the beginning: one, two e-mails a day. Spammers know that they can not exaggerate,
at least not at once. Deletion of few e-mails, informing about a great business in Nigeria, or a big lottery winnings doesn’t
cause major problems. Companies concerned with “internet promotion” exchange addresses they have, and with time,
the number of letters increases. When there are 100-200 letters a day in a mail box, spam deletion becomes a forced morning
ritual – man can get used to everything.
Nerves let go some beautiful morning, when during a fight with
a flood of unwanted correspondence, we erase that one important letter, we were waiting for. Usually, after that time we will
realize, that the spam connects with a real threaten for a business: it increases probability of overlooking some important
letter, it brakes away people from a productive work, and above it, stuck internet connections, and decreases efficiency of
The fear against the black list
oldest method of fighting with the spam, is informing administrators of a net from which the spam was sent. It requires many
treatments from administrator’s side: additional contact with users of a mail system, analysis of headlines in unwanted
letters, searching contacts with administrators of other nets, and additional correspondence with them. Part of those activities
can be automated, with a usage of proper scripts. Still, it is a very ungrateful work to do, and without some form of automation,
it’s hard to bear for a longer time.
The simpler technique of fighting with a spam, on the level of a mail
server, is usage of so called “black lists”-made by volunteers, and exchanged between administrators, a list of
domains, on which bad configuration of nets is working, what enables sending a spam. Finding in that list, is an environmental
embarrassment for domain’s administrator, and usually, it is a reason for an immediate action. However, the black lists
are not the cure for a spam.
There is also a real danger connected with those lists, when sometimes, particular
addresses might be on that list accidentally, or by somebody’s purposeful actions. Careless usage of black lists may
end up with a situation, when access to a web will be shut off for users, on which we care about, for example customers. Repeatedly
realized tests show that this method allows to eliminate at least approximately 50% of a spam, and most often lesser-within
15-25%. Attractive is that it is a very simple in implementation, that’s why it can be threaten as a first step in fight
with a spam. In that purpose it’s worth to use checked lists, for example The Spamhaus Block List.
As a rule ineffective
Technique used successfully for fighting with spam, is filtering
headlines and letters content on the level of mail gate or mail server. There are filters which use signatures (patterns),
and the one’s that analyzes the content of a letter, according to some general criteria’s (scanners based on rules
called sometimes heuristic).
Currently, high-specialized companies, are dealing with formation of patterns. For
example, a Brighmail company supports over 2 mln mail boxes, on which daily comes about 2 mld spammers consignments. For every
letter a signature is generated, which finally is putted in a data base available for clients. If the client gets a letter
similar to the one that is described in a signature, it would be marked as a spam. Of course spammers are aware of it, and
they use different methods to deceive filters (for example, they put a random word into a letter, to make it different from
the pattern contained in a signature). Spam is still flooding us with a wide stream, drove by spammers creativity, and the
creators of the anti-spam software are successively generating new signatures.
This race seems to have no end-the
same as we can’t see the end of people zeal for writing new viruses. Considering the fact, that the efficiency of that
kind of filters is evaluate on 50-70%, it seems that in this race, spammers are still one step ahead. Filters require to maintain
a constant, and proper infrastructure, what enable formation of signatures, in consequence, it makes those solutions quite
expansive. Solution of Brighmail company, mentioned before, costs 1500 USD a year for a license for 50 users. Their strong
side is that in practice, there is no possibility for a letter to be qualified by software as a spam. This is a very important
factor for companies, which need to operate properly, to establish contacts with new clients by usage of an electronic mail.
Filters based on rules are working differently. Most often they don’t store any data base, which should
be currently updated. Instead of, they analyze the content of every letter according to some criteria’s. The filters
are checking for example, if the letter has been written with a capital letters, if there are some key words that most often
appear in the spam, like “YOU WON”, “SEX”, “VIAGRA”, to how many people it is addressed,
if it includes reference marks to outer sides, and to what kind of sides, and so on. Every rule has a proper, annotated weight,
that is submitted by a number of points. If the letter get some specified number of points (for example it contains a “suspicious”
key words and links to outer sides), it’ll be marked as a spam. The efficiency of that filters is quite satisfaction
– it oscillates in 90-95%. It concerns only to the best software of that type, like very popular Spam-Assassin evolved
on open source rule.
Unfortunately there is no rose without thorns. It appears that that kind of filters, have
a tendency to mark as a spam letters, which don’t meter to that category. Interest of badly classified letters reach
over 0,5%, what in most cases is a result inadmissible, because it causes “cutting out” authorized correspondence.
Additionally, because the software uses concrete algorithms, before sending a concrete letter, spammers can just check if
it can get through available filters. Chances of deceiving a filter are practically big.
Centered and dissipated
Central filtering of mail on server, which often services few hundred
or few thousand mail accounts, can’t be however considered as efficient. It is easy to imagine that basically, every
user utilizes different “kind” of letters. Users may use, for example different languages, disputable lists, sends
many enclosures, or in content of letters there can often be links to other pages – another words, it’s impossible
to configure the central filter of the mail in the way, that it would take under consideration all possible preferences. Fortunately,
in many packets there is a possibility of open out tools for users, for their independent configuration.
mentioned before, makes available some possibilities of configuration. If, for example, there are many reference marks to
WWW web sites in mail, we can decrease the number of weight points giving to that kind of letters, and so, we can decrease
the chance, that some important letter would be admitted as the spam. Similarly, if for example we never use other language
than Polish, we can increase number of points that would be given to letters containing words in different languages. Thanks
to these actions, the effectiveness of the filter would be increased, in a very simple way. Independently we can individually
create new rules. In Spam-Assassin packet we can do it with a usage of regular expressions.
The fact, that those
possibilities of accommodation to user’s preferences, are the key to quality of solutions for fighting with a spam,
producers had known a long time ago, and in that exact direction, basically all main systems of that type are going to. Ideal
would be a software, that would individually learn, which letters does the user consider as a spam, and which are the ordinary
mail. Best of all, if the software would require less configuration as possible, and less actions from the interested side.
Utopia? Appears not. Latest software for fighting with a spam use to it’s work neuron nets. Those solutions
are equipped in some learning possibilities and can foresee the results, basing on information’s gained earlier. This
approach, connected with tools of statistic analysis works perfectly during the fight with the spam. And so, we get to next
type of anti-spam software, so called Bayesian Filtering. Named after Thomas Bayes – XVIII century creator of statistic
The idea of utilization of the statistic analysis for filtering isn’t new. It’s roots go
back to 1996. However, just recently elaboration of the algorithms, which would work in filtering spam, succeeded. Actually,
the real explosion of that kind of software started by the end of year 2003. Latest versions of programs, which are using
filtering based on Bayes conception, have an efficiency factor on level of 99,98%! Additionally the number of “good”
correspondence, classified by those filters as a spam, practically equals zero. Where does those good result come from?
The user decides on his own
The filter based on Bayes algorithm
works as followed. For every user on mail server or in clients software, two files are created. In first one, there are information’s
about letters recognized as a spam, in second - other letters. At the begging the files are empty, because the user must “learn”
the software how to distinguish spam from the right mail. This process looks different depending on what software do we use.
In most programs installed on server, after getting a mail, user sends it back to his account with a note whether
it is a spam or a normal letter (that how a free software called CRM114 work out). Sometimes, instead of sending the whole
letter, it is enough to send its signature (that’s how another packet works – also free – called Dspam).
In case of some commercial packets, like SpamBully (cost approximately 30 USD for one position), one click is sufficient.
Regardless of method of marking the spam by a user, all programs work similarly.
Learning method used most often
by Bayes filters, is called Train Only Errors (TOE). It means, that only letters in classification of which mistakes were
made, are shown to the program. Of course, at the beginning, the program makes a lot of mistakes and its usage might seem
troublesome. The question is then, how soon will we notice the advantages of usage Bayes filter. And another surprise here.
It appeared, that after classification of few dozens of letters, the program is effective in over 90% of its decisions. Classification
of approximately 100-150 letters, allows to reach the effectiveness on level exceeded 95%. After a month of using the
program, over a 99% of letters is classified correctly. Is it worth to wait that long?
Definitely yes! When filter
starts to distinguish mail from the spam, its efficiency will always remain high. More than that, because the program is learning
all the time, changes both, in the way the spam is formed, and the character of created and received “proper”
mail, are currently considered. Other words, thanks to indicators received from the user, program evolves with his needs.
Not very efficient but effective
Files, in which information’s
allowing to distinguish a spam from a normal mail are stored, takes usually from few to dozes megabytes. Until files are kept
in clients computers, there are no problems with efficiency or capacity of discs. However, storage of that big files on mail
server may cause a necessity for buying extra discs. If we consider necessary processor power, presently, the quickest programs
allow to check incoming mail in rate approximately equal 120 Kb/s (measurements on server with one processor, Pentium III
1,4 GHz). In case of big server installations, it seems requisite to build a farm that will balance the burden.
If Bayes filters are that efficient, the question is: how hard it’s for spammers to construct a letter, that would
pass our filter. Knowing that this type of defense against the spam is nowadays rarely used, it’s hard to make conclusions
about its efficiency in the future. One is certain – the fact, that those kind of filters are adjusted to individual
users demands, cause that construction of letter, that would pass all filers, is practically impossible.
are some proposals, which can help to increase probability, that given letter would pass a large group of filters, but all
of them require some kind of data base about what kind of letters should be send to concrete addresses. Considering the fact,
that for a million spam letters that were sent, response about 1 to 5 persons, having that kind of data base, is for spammers
simply unprofitable. Beyond that, it’s hard to expect that the user which freely uses that kind of software, would reply
on any offer sent that way. It means that this area of consumers market, is for spammers unattractive.
It is already seen that usage of filters of that kind is effective.
Within the space of last months, we can notice
a clear tendency to decrease the number of information’s contented in letters, and instead of it, reference marks to
web sites are included. More often the spam letter contain one sentence “to encourage” and a reference mark, which
after clicking, opens the proper commercial site. The efficiency of that kind of actions is lesser than in a case of a normal
spam. Implementations of Bayes filters appeared very recently, but their efficiency increases practically day after day. That’s
why there is a real chance for spam to be limited.
Spam stays behind the door
What is the future for anti-spam software? Presently existing propositions, like introducing micro-payments, or the
demand to carry out, by a sending computer, many mathematical actions, to decrease the speed of spam that is send, needs changes
or development of SMPT protocol. It means that new solutions would need a new software. The chance for, that the majority
of Internet users would decide to use them is low.
Other ideas, like aggressive filters (filters that fight back),
also do not have the bright future ahead. There functioning would consist in, that the special automatons would connect the
WWW addresses included in spam, and through that actions, overload spammers servers. However this technique includes the danger
of using it to attacks known as Distributed Denial of Service.
The real chance for effective fight with a spam,
gives surely only Bayes filters. If after a few months from origin of its early implementation, they are characterized with
such high effectiveness, what kind of effectiveness can we expect from grown products of that type, which may appear on market
for example for a year?
From the other hand, who knows? Maybe till that time, spammers will find new techniques,
which would allow to pass those protection. In computer science, it very often appears, that systems seemingly impossible
to break, after some time, are effectively deceived. If that is going to happen with Bayes filters, we will see probably during
few of months. One is certain: the arsenal of mediums against the spammers, had never had such effective weapon.
Even if stopping the spam on the “entrance” to inside web, by usage of Bayes filters or any other methods, will
become the stable way to limit spam, it won’t solve the basic problem, which is the possibility to send it. Maybe some
real changes in infrastructure are necessary. However, that would be under decision of the internet great ones, about which
Antonii Bielewicz wrote in his article “Mail headache” in CW 11/2004
Classification according to Bayes
Thomas Bayes, the XVIII century British cleric,
elaborated the method, that in simplification, allows to predict the probability of occurring phenomenon’s in the future,
basing on observations of frequency of its occurrence so far. Bayes method is used, among other things, for solving problems
with sorting and classification of data with usage of learning machines, and also in banking and insurance, so everywhere
where risk appears. The latest solution, is classification of e-mail massages at an angle of spam.
says, that for two independent events A and B, the probability for B to occur, if A has occurred is equal to:
P(B|A) = (P(B) * P(A|B)) / P(A) = P(A i B) / P(A),
Where P(A) is the probability
that A would occur ;
P(A/B) is the probability that A would occur if B has already occurred;
P(A i B) is the probability
that both A and B would occur;
In reference to spam A means getting a normal letter and B is letter with a spam.
The formula shown above can be read as following: the probability that the new letter is a spam, is equal to probability that
this concrete letter is a spam – established basing on comparison with letters received so far, and considered as the
spam, multiplied with a share of normal letters in a correspondence so far, and divided through probability that this letter
is not a spam – established basing on comparisons with letters considered as the spam.
In reference to banking,
Bayes classification can be used for estimation of credibility of clients. For example creation of two decision classes can
be done: first are the clients which are not paying rates on time. The second, are the clients which conscience discharge
of a payment. When the new client appears, thanks to Bayes methods, the probability, to which group would the client belong,
can be estimated, and basing the observation of repayment of a credit, we can modify the evaluation of quality of assigning
to one of two classes.
On the similar rule, also works for example catalogs of WWW web sites in internet search.
In this case, a proper thematic groups are created (for example sport, news, entertainment), which have a suitable sites annotated.
99,98% - that is the effectiveness factor of filters based on Bayes algorithm and it’s still growing! Comparing
to heuristic methods, they have efficiency on the level of 90-95%, filters based on patterns have efficiency 50-70%, and the
black lists 15-25%.
Have you send it for sure?
from Thomas Bayes filters, the interesting way to limit spam is technique called Challenge-Response Filtering. In shortening,
it consists in, that to sender of a letter, who has never wrote to us before, filter is sending a query. Only after the sender
gives his answers (most often he has to send back the letter), the exact letter is delivered to our inbox. This simple method
stops over 99,9% of spam, though it can be a little bit arduous, especially for people who use mail box sporadically.
If it’s possible, Challenge-Response Filtering technique works better when it is used parallel with other techniques.
Most advantageous solution seems to be a usage of Bayes filter or heuristic filtering of content, based on rules for preliminary
mail filing, and after that usage of Challenge-Response Filtering technique to letters that have been marked as a spam. Thanks
to that, there is a certainty, that any important message has been overlooked. The commercial software SpamBully, mentioned
before, works exactly in that way. If we want to use an open source software, we have to connect functionality of two different
programs, for example Spam-Assassin and Active Spam Killer.
Protect what you can
Users of mail systems can do quite a lot, to limit a risk of placing their address on spammers sending lists. In first
row, it benefits to order usage of aliases, instead of real account names, especially if the address is supposed to be publicly
available, for example in WWW web site. We can also advice to avoid giving the address to casual people, though in practice
for example in business it’s rather hard. For sure it is worth, to consider not giving it on disputable lists.
All treatments mentioned above, are effective only till the moment, when we won’t make a mistake, or until somebody
trusted will get the idea, of sending to us an internet card from a public service. The conclusion, is that, these are auxiliary,
additional centers, which in a long term will not allow to get rid of a spam problem.
that uses Bayes filters
SpamBully ( http://www.spambully.com
IronMail 4.0 ( http://www.CipherTrust.com )
Disruptor OL ( http://www.hlembke.de/prod/disruptor
InBoxer ( http://www.inboxer.com )
InboxShield ( http://www.edovia.com/inboxshield
PreciseMail ( http://www.process.com/precisemail
( http://www.spamassasin.org )
CRM114 ( http://www.crm114.sourceforge.net )
Dspam ( http://www.dspam.com )
POPFile ( http://popfile.sourceforge.net/cgi-bin/wiki.pl )
SpamTUNNEL ( http://uiorean.cluj.astral.ro )
SpamProbe ( http://spamprobe.sourceforge.net )
ifile ( http://www.nongnu.org/ifile/ )