N

nospam

IP-agnostic blog spam filter.

NoSpam: An IP-agnostic blog spam filtering solution
===================================================

NoSpam is a service for real-time detection of blog comment spam.

It works without relying on the reputation of the submitter's IP
address, which is a common limitation in the more popular spam
filtering services that doesn't allow for anonymous publication.

The filtering is based on a mix of Bayesian classification, analysis
of external URLs, and a simple rule engine. We feel that its results
are accurate enough, even without having the poster's IP address
available.

This software was heavily inspired by http://blogspam.net/ and it
actually offers a compatible XML-RPC API.

The service is easily extensible using a plugin architecture: just put
your code in the plugins/ directory and create a subclass of
nospam.plugin_base.BasePlugin.



API Description
---------------

The XML-RPC API is designed to be a functional subset of the one
offered by blogspam.net. Two methods are available:

* string testComment(struct data)

  The 'data' parameter can have the following attributes:

    'comment': the text of the comment (utf-8 encoded, MANDATORY)
    'site': the URL of your website (unused for now)
    'agent': the User-Agent supplied by the user, if any
    'email': the email address supplied by the user, if any
    'link': the homepage link supplied by the user, if any
    'name': the name supplied by the user, if any

  Any other attributes will be ignored.  The return value will be a
  string, formatted as following:

    OK - comment is not spam
    SPAM:[msg] - comment is spam, msg will have details
    ERROR:[msg] - an error occurred while processing the comment

* string classifyComment(struct data)

  The 'data' parameter only has two mandatory attributes:

    'comment': the text of the comment (utf-8 encoded)
    'train': desired classification, either 'ok' or 'spam'

  The return value is a string, formatted as following:

    OK - classification was successful
    ERROR:[msg] - an error occurred



Training and testing
--------------------

The scripts in the train/ directory allow for training and testing the
service using pre-existing data. It is expected that this data will be
formatted as a stream of pickled Python dictionaries, each having the
attributes required by the testComment() method above, plus a 'train'
attribute (as in classifyComment()) which is required to train on the
dataset.

A simple script to generate training data from a (multisite) Wordpress
installation is provided in train/wpdump.py.