NoSpam: An IP-agnostic blog spam filtering solution =================================================== NoSpam is a service for real-time detection of blog comment spam. It works without relying on the reputation of the submitter's IP address, which is a common limitation in the more popular spam filtering services that doesn't allow for anonymous publication. The filtering is based on a mix of Bayesian classification, analysis of external URLs, and a simple rule engine. We feel that its results are accurate enough, even without having the poster's IP address available. This software was heavily inspired by http://blogspam.net/ and it actually offers a compatible XML-RPC API. The service is easily extensible using a plugin architecture: just put your code in the plugins/ directory and create a subclass of nospam.plugin_base.BasePlugin. API Description --------------- The XML-RPC API is designed to be a functional subset of the one offered by blogspam.net. Two methods are available: * string testComment(struct data) The 'data' parameter can have the following attributes: 'comment': the text of the comment (utf-8 encoded, MANDATORY) 'site': the URL of your website (unused for now) 'agent': the User-Agent supplied by the user, if any 'email': the email address supplied by the user, if any 'link': the homepage link supplied by the user, if any 'name': the name supplied by the user, if any Any other attributes will be ignored. The return value will be a string, formatted as following: OK - comment is not spam SPAM:[msg] - comment is spam, msg will have details ERROR:[msg] - an error occurred while processing the comment * string classifyComment(struct data) The 'data' parameter only has two mandatory attributes: 'comment': the text of the comment (utf-8 encoded) 'train': desired classification, either 'ok' or 'spam' The return value is a string, formatted as following: OK - classification was successful ERROR:[msg] - an error occurred Training and testing -------------------- The scripts in the train/ directory allow for training and testing the service using pre-existing data. It is expected that this data will be formatted as a stream of pickled Python dictionaries, each having the attributes required by the testComment() method above, plus a 'train' attribute (as in classifyComment()) which is required to train on the dataset. A simple script to generate training data from a (multisite) Wordpress installation is provided in train/wpdump.py.
IP-agnostic blog spam filter.