CADENCE's Bayes_filtering

Bayes eternally alive

5 April 2004

So far methods of fighting with the spam, from the users of mail systems point of view, appeared to be ineffective. However, it seems that the solution of the problem are statistic functions, discovered by Thomas Bayes in XVIII century.

Author: Tomasz Grabowski

Translation: Aleksandra Malak

It looks innocent at the beginning: one, two e-mails a day. Spammers know that they can not exaggerate, at least not at once. Deletion of few e-mails, informing about a great business in Nigeria, or a big lottery winnings doesn’t cause major problems. Companies concerned with “internet promotion” exchange addresses they have, and with time, the number of letters increases. When there are 100-200 letters a day in a mail box, spam deletion becomes a forced morning ritual – man can get used to everything.

Nerves let go some beautiful morning, when during a fight with a flood of unwanted correspondence, we erase that one important letter, we were waiting for. Usually, after that time we will realize, that the spam connects with a real threaten for a business: it increases probability of overlooking some important letter, it brakes away people from a productive work, and above it, stuck internet connections, and decreases efficiency of mail servers.

The fear against the black list

The oldest method of fighting with the spam, is informing administrators of a net from which the spam was sent. It requires many treatments from administrator’s side: additional contact with users of a mail system, analysis of headlines in unwanted letters, searching contacts with administrators of other nets, and additional correspondence with them. Part of those activities can be automated, with a usage of proper scripts. Still, it is a very ungrateful work to do, and without some form of automation, it’s hard to bear for a longer time.

The simpler technique of fighting with a spam, on the level of a mail server, is usage of so called “black lists”-made by volunteers, and exchanged between administrators, a list of domains, on which bad configuration of nets is working, what enables sending a spam. Finding in that list, is an environmental embarrassment for domain’s administrator, and usually, it is a reason for an immediate action. However, the black lists are not the cure for a spam.

There is also a real danger connected with those lists, when sometimes, particular addresses might be on that list accidentally, or by somebody’s purposeful actions. Careless usage of black lists may end up with a situation, when access to a web will be shut off for users, on which we care about, for example customers. Repeatedly realized tests show that this method allows to eliminate at least approximately 50% of a spam, and most often lesser-within 15-25%. Attractive is that it is a very simple in implementation, that’s why it can be threaten as a first step in fight with a spam. In that purpose it’s worth to use checked lists, for example The Spamhaus Block List.

As a rule ineffective

Technique used successfully for fighting with spam, is filtering headlines and letters content on the level of mail gate or mail server. There are filters which use signatures (patterns), and the one’s that analyzes the content of a letter, according to some general criteria’s (scanners based on rules called sometimes heuristic).

Currently, high-specialized companies, are dealing with formation of patterns. For example, a Brighmail company supports over 2 mln mail boxes, on which daily comes about 2 mld spammers consignments. For every letter a signature is generated, which finally is putted in a data base available for clients. If the client gets a letter similar to the one that is described in a signature, it would be marked as a spam. Of course spammers are aware of it, and they use different methods to deceive filters (for example, they put a random word into a letter, to make it different from the pattern contained in a signature). Spam is still flooding us with a wide stream, drove by spammers creativity, and the creators of the anti-spam software are successively generating new signatures.

This race seems to have no end-the same as we can’t see the end of people zeal for writing new viruses. Considering the fact, that the efficiency of that kind of filters is evaluate on 50-70%, it seems that in this race, spammers are still one step ahead. Filters require to maintain a constant, and proper infrastructure, what enable formation of signatures, in consequence, it makes those solutions quite expansive. Solution of Brighmail company, mentioned before, costs 1500 USD a year for a license for 50 users. Their strong side is that in practice, there is no possibility for a letter to be qualified by software as a spam. This is a very important factor for companies, which need to operate properly, to establish contacts with new clients by usage of an electronic mail.

Filters based on rules are working differently. Most often they don’t store any data base, which should be currently updated. Instead of, they analyze the content of every letter according to some criteria’s. The filters are checking for example, if the letter has been written with a capital letters, if there are some key words that most often appear in the spam, like “YOU WON”, “SEX”, “VIAGRA”, to how many people it is addressed, if it includes reference marks to outer sides, and to what kind of sides, and so on. Every rule has a proper, annotated weight, that is submitted by a number of points. If the letter get some specified number of points (for example it contains a “suspicious” key words and links to outer sides), it’ll be marked as a spam. The efficiency of that filters is quite satisfaction – it oscillates in 90-95%. It concerns only to the best software of that type, like very popular Spam-Assassin evolved on open source rule.

Unfortunately there is no rose without thorns. It appears that that kind of filters, have a tendency to mark as a spam letters, which don’t meter to that category. Interest of badly classified letters reach over 0,5%, what in most cases is a result inadmissible, because it causes “cutting out” authorized correspondence. Additionally, because the software uses concrete algorithms, before sending a concrete letter, spammers can just check if it can get through available filters. Chances of deceiving a filter are practically big.

Centered and dissipated

Central filtering of mail on server, which often services few hundred or few thousand mail accounts, can’t be however considered as efficient. It is easy to imagine that basically, every user utilizes different “kind” of letters. Users may use, for example different languages, disputable lists, sends many enclosures, or in content of letters there can often be links to other pages – another words, it’s impossible to configure the central filter of the mail in the way, that it would take under consideration all possible preferences. Fortunately, in many packets there is a possibility of open out tools for users, for their independent configuration.

The Spam-Assassin mentioned before, makes available some possibilities of configuration. If, for example, there are many reference marks to WWW web sites in mail, we can decrease the number of weight points giving to that kind of letters, and so, we can decrease the chance, that some important letter would be admitted as the spam. Similarly, if for example we never use other language than Polish, we can increase number of points that would be given to letters containing words in different languages. Thanks to these actions, the effectiveness of the filter would be increased, in a very simple way. Independently we can individually create new rules. In Spam-Assassin packet we can do it with a usage of regular expressions.

The fact, that those possibilities of accommodation to user’s preferences, are the key to quality of solutions for fighting with a spam, producers had known a long time ago, and in that exact direction, basically all main systems of that type are going to. Ideal would be a software, that would individually learn, which letters does the user consider as a spam, and which are the ordinary mail. Best of all, if the software would require less configuration as possible, and less actions from the interested side.

Utopia? Appears not. Latest software for fighting with a spam use to it’s work neuron nets. Those solutions are equipped in some learning possibilities and can foresee the results, basing on information’s gained earlier. This approach, connected with tools of statistic analysis works perfectly during the fight with the spam. And so, we get to next type of anti-spam software, so called Bayesian Filtering. Named after Thomas Bayes – XVIII century creator of statistic analysis.

The idea of utilization of the statistic analysis for filtering isn’t new. It’s roots go back to 1996. However, just recently elaboration of the algorithms, which would work in filtering spam, succeeded. Actually, the real explosion of that kind of software started by the end of year 2003. Latest versions of programs, which are using filtering based on Bayes conception, have an efficiency factor on level of 99,98%! Additionally the number of “good” correspondence, classified by those filters as a spam, practically equals zero. Where does those good result come from?

The user decides on his own

The filter based on Bayes algorithm works as followed. For every user on mail server or in clients software, two files are created. In first one, there are information’s about letters recognized as a spam, in second - other letters. At the begging the files are empty, because the user must “learn” the software how to distinguish spam from the right mail. This process looks different depending on what software do we use.

In most programs installed on server, after getting a mail, user sends it back to his account with a note whether it is a spam or a normal letter (that how a free software called CRM114 work out). Sometimes, instead of sending the whole letter, it is enough to send its signature (that’s how another packet works – also free – called Dspam). In case of some commercial packets, like SpamBully (cost approximately 30 USD for one position), one click is sufficient. Regardless of method of marking the spam by a user, all programs work similarly.

Learning method used most often by Bayes filters, is called Train Only Errors (TOE). It means, that only letters in classification of which mistakes were made, are shown to the program. Of course, at the beginning, the program makes a lot of mistakes and its usage might seem troublesome. The question is then, how soon will we notice the advantages of usage Bayes filter. And another surprise here. It appeared, that after classification of few dozens of letters, the program is effective in over 90% of its decisions. Classification of approximately 100-150 letters, allows to reach the effectiveness on level exceeded 95%. After a month of using the program, over a 99% of letters is classified correctly. Is it worth to wait that long?

Definitely yes! When filter starts to distinguish mail from the spam, its efficiency will always remain high. More than that, because the program is learning all the time, changes both, in the way the spam is formed, and the character of created and received “proper” mail, are currently considered. Other words, thanks to indicators received from the user, program evolves with his needs.

Not very efficient but effective

Files, in which information’s allowing to distinguish a spam from a normal mail are stored, takes usually from few to dozes megabytes. Until files are kept in clients computers, there are no problems with efficiency or capacity of discs. However, storage of that big files on mail server may cause a necessity for buying extra discs. If we consider necessary processor power, presently, the quickest programs allow to check incoming mail in rate approximately equal 120 Kb/s (measurements on server with one processor, Pentium III 1,4 GHz). In case of big server installations, it seems requisite to build a farm that will balance the burden.

If Bayes filters are that efficient, the question is: how hard it’s for spammers to construct a letter, that would pass our filter. Knowing that this type of defense against the spam is nowadays rarely used, it’s hard to make conclusions about its efficiency in the future. One is certain – the fact, that those kind of filters are adjusted to individual users demands, cause that construction of letter, that would pass all filers, is practically impossible.

There are some proposals, which can help to increase probability, that given letter would pass a large group of filters, but all of them require some kind of data base about what kind of letters should be send to concrete addresses. Considering the fact, that for a million spam letters that were sent, response about 1 to 5 persons, having that kind of data base, is for spammers simply unprofitable. Beyond that, it’s hard to expect that the user which freely uses that kind of software, would reply on any offer sent that way. It means that this area of consumers market, is for spammers unattractive.

It is already seen that usage of filters of that kind is effective.

Within the space of last months, we can notice a clear tendency to decrease the number of information’s contented in letters, and instead of it, reference marks to web sites are included. More often the spam letter contain one sentence “to encourage” and a reference mark, which after clicking, opens the proper commercial site. The efficiency of that kind of actions is lesser than in a case of a normal spam. Implementations of Bayes filters appeared very recently, but their efficiency increases practically day after day. That’s why there is a real chance for spam to be limited.

Spam stays behind the door

What is the future for anti-spam software? Presently existing propositions, like introducing micro-payments, or the demand to carry out, by a sending computer, many mathematical actions, to decrease the speed of spam that is send, needs changes or development of SMPT protocol. It means that new solutions would need a new software. The chance for, that the majority of Internet users would decide to use them is low.

Other ideas, like aggressive filters (filters that fight back), also do not have the bright future ahead. There functioning would consist in, that the special automatons would connect the WWW addresses included in spam, and through that actions, overload spammers servers. However this technique includes the danger of using it to attacks known as Distributed Denial of Service.

The real chance for effective fight with a spam, gives surely only Bayes filters. If after a few months from origin of its early implementation, they are characterized with such high effectiveness, what kind of effectiveness can we expect from grown products of that type, which may appear on market for example for a year?

From the other hand, who knows? Maybe till that time, spammers will find new techniques, which would allow to pass those protection. In computer science, it very often appears, that systems seemingly impossible to break, after some time, are effectively deceived. If that is going to happen with Bayes filters, we will see probably during few of months. One is certain: the arsenal of mediums against the spammers, had never had such effective weapon.

Even if stopping the spam on the “entrance” to inside web, by usage of Bayes filters or any other methods, will become the stable way to limit spam, it won’t solve the basic problem, which is the possibility to send it. Maybe some real changes in infrastructure are necessary. However, that would be under decision of the internet great ones, about which Antonii Bielewicz wrote in his article “Mail headache” in CW 11/2004

Classification according to Bayes

Thomas Bayes, the XVIII century British cleric, elaborated the method, that in simplification, allows to predict the probability of occurring phenomenon’s in the future, basing on observations of frequency of its occurrence so far. Bayes method is used, among other things, for solving problems with sorting and classification of data with usage of learning machines, and also in banking and insurance, so everywhere where risk appears. The latest solution, is classification of e-mail massages at an angle of spam.

Bayes statement says, that for two independent events A and B, the probability for B to occur, if A has occurred is equal to:

P(B|A) = (P(B) * P(A|B)) / P(A) = P(A i B) / P(A),

Where P(A) is the probability that A would occur ;
P(A/B) is the probability that A would occur if B has already occurred;
P(A i B) is the probability that both A and B would occur;

In reference to spam A means getting a normal letter and B is letter with a spam. The formula shown above can be read as following: the probability that the new letter is a spam, is equal to probability that this concrete letter is a spam – established basing on comparison with letters received so far, and considered as the spam, multiplied with a share of normal letters in a correspondence so far, and divided through probability that this letter is not a spam – established basing on comparisons with letters considered as the spam.

In reference to banking, Bayes classification can be used for estimation of credibility of clients. For example creation of two decision classes can be done: first are the clients which are not paying rates on time. The second, are the clients which conscience discharge of a payment. When the new client appears, thanks to Bayes methods, the probability, to which group would the client belong, can be estimated, and basing the observation of repayment of a credit, we can modify the evaluation of quality of assigning to one of two classes.

On the similar rule, also works for example catalogs of WWW web sites in internet search. In this case, a proper thematic groups are created (for example sport, news, entertainment), which have a suitable sites annotated.

99,98% - that is the effectiveness factor of filters based on Bayes algorithm and it’s still growing! Comparing to heuristic methods, they have efficiency on the level of 90-95%, filters based on patterns have efficiency 50-70%, and the black lists 15-25%.

Have you send it for sure?

Independently from Thomas Bayes filters, the interesting way to limit spam is technique called Challenge-Response Filtering. In shortening, it consists in, that to sender of a letter, who has never wrote to us before, filter is sending a query. Only after the sender gives his answers (most often he has to send back the letter), the exact letter is delivered to our inbox. This simple method stops over 99,9% of spam, though it can be a little bit arduous, especially for people who use mail box sporadically.

If it’s possible, Challenge-Response Filtering technique works better when it is used parallel with other techniques. Most advantageous solution seems to be a usage of Bayes filter or heuristic filtering of content, based on rules for preliminary mail filing, and after that usage of Challenge-Response Filtering technique to letters that have been marked as a spam. Thanks to that, there is a certainty, that any important message has been overlooked. The commercial software SpamBully, mentioned before, works exactly in that way. If we want to use an open source software, we have to connect functionality of two different programs, for example Spam-Assassin and Active Spam Killer.

Protect what you can

Users of mail systems can do quite a lot, to limit a risk of placing their address on spammers sending lists. In first row, it benefits to order usage of aliases, instead of real account names, especially if the address is supposed to be publicly available, for example in WWW web site. We can also advice to avoid giving the address to casual people, though in practice for example in business it’s rather hard. For sure it is worth, to consider not giving it on disputable lists.

All treatments mentioned above, are effective only till the moment, when we won’t make a mistake, or until somebody trusted will get the idea, of sending to us an internet card from a public service. The conclusion, is that, these are auxiliary, additional centers, which in a long term will not allow to get rid of a spam problem.

Packets that uses Bayes filters

Commercial

    SpamBully ( http://www.spambully.com )
    IronMail 4.0 ( http://www.CipherTrust.com )
    Disruptor OL ( http://www.hlembke.de/prod/disruptor )
    InBoxer ( http://www.inboxer.com )
    InboxShield ( http://www.edovia.com/inboxshield )
    PreciseMail ( http://www.process.com/precisemail

Open source

    SpamAssasin ( http://www.spamassasin.org )
    CRM114 ( http://www.crm114.sourceforge.net )
    Dspam ( http://www.dspam.com )
    POPFile ( http://popfile.sourceforge.net/cgi-bin/wiki.pl )
    SpamTUNNEL ( http://uiorean.cluj.astral.ro )
    SpamProbe ( http://spamprobe.sourceforge.net )
    ifile ( http://www.nongnu.org/ifile/ )