Name Last Update
config Loading commit data...
debian Loading commit data...
report Loading commit data...
runtime Loading commit data...
sa_train Loading commit data...
.gitignore Loading commit data...
.gitlab-ci.yml Loading commit data...
README.md Loading commit data...
sa_report Loading commit data...
setup.py Loading commit data...

sa-train

Suite of tools to build a Spamassassin Bayes database from a training dataset, resulting in a dump that can be distributed to multiple hosts using sa-learn --restore. The training can be run anywhere, as it runs its own instance of Spamassassin in isolation. This allows operators to provide a global Bayesian database for filtering purposes to a pool of separate servers, without introducing a runtime dependency on a single shared database.

These tools do not support having a Bayes database per user, but they do support a feedback channel, that allows users to report mis-classified emails back to the global database for retraining.

Since training data (especially ham messages) can contain personally identifying information, the tool supports encrypting data at rest using GPG. Training data will then be decrypted at runtime. Note that the final Spamassassin database dump is not encrypted, as it is impossible to retrieve the original messages from it.

Installation

The full sa-train suite does not need to be installed on your servers, as it can run autonomously anywhere (though preferably on a fast machine with lots of memory). It will install and run its own instance of Spamassassin in a container to populate the database, so the list of requirements is not long:

  • Python 2 (>2.5)
  • Docker
  • GnuPG
  • python-gnupg

The optional sa_report tool is meant as a replacement for sa_learn, and should be installed on all the servers where you want to provide feedback functionality to your users. It has the following requirements:

  • Python 2 (>2.5)
  • GnuPG
  • python-gnupg

Usage

There is no installation procedure yet, so it's recommended to run the training suite directly from the source directory:

$ git clone https://git.autistici.org/ai/sa-train.git
$ cd sa-train

Assuming you have collected some static training data, i.e. separate sets of spam and non-spam messages, create a directory to hold it (for instance, ~/training). Create two subdirectories therein, named spam and ham, and put your messages in there, saved in mbox format. Mailbox files should have a .mbox extension.

You can then start the training process with:

$ ./run.sh --source=$HOME/training

At the end of this (which will take a while), there will be a file named sa.dat in the current directory. This contains the dump of the Bayes database, obtained with sa-learn --backup, and it can be copied to a remote server and restored with sa-learn --restore.

Data sources

The sa_train tool supports two kinds of data sources:

  • mbox files: sa_train expects a directory with two subdirectories, ham and spam, each containing one or more mailboxes. Mailbox files should be in mbox format and should have a .mbox extension. If the mboxes are encrypted, the extension should be .mbox.gpg.

  • IMAP folder: sa_train can read emails from a special IMAP mailbox used for user feedback reports. This mailbox will contain messages sent by the sa_report tool, each individually encrypted.

sa_train will read the sources containing training data and aggregate them. Since there's usually not much gain in training Spamassassin with more than a few thousand messages of each type, sa_train has a --sample option (which takes values in the 0 - 1 range) to randomly sample a subset of the available messages.

Training data selection

The training data set should contain comparable amounts of spam and non-spam messages, it should be representative of the traffic you expect to receive, and the messages themselves should be relatively recent.

More specific guidelines are available on the Spamassassin website.

User feedback

Is is usually a good idea to have a channel for users to report mis-classification of their email, so that the Bayes filter can be constantly tuned based on live traffic. This is especially true for multi-user setups, where collecting a representative sample of data might be difficult.

sa-train does not support real-time updates of the Bayes database, but it can collect user feedback reports (into an IMAP mailbox) and use them as a source on the next training run.

To do this, send messages to the sa_report tool, along with their desired classification (spam or non-spam). The tool will anonymize the email message, encrypt it with GPG, and send it to the feedback collection address.

For example, the following configuration snippet shows how to integrate sa_report with the dovecot-antispam plugin:

plugin {
    antispam_backend = pipe
    antispam_pipe_program = /usr/bin/sa_report
    antispam_pipe_program_spam_arg = --spam
    antispam_pipe_program_notspam_arg = --ham
}