Commit e4038ef4 by ale

added a README

1 parent 5d9cf5d7
Showing with 128 additions and 1 deletions
A suite of ML-based tools to detect accounts compromised by spammers,
based on mailserver log analysis.
The tools contained in this repository are:
- *loganalyzer*: parses mail.log syslog files, reconstructs mail flows,
and produces per-user aggregates of a number of interesting metrics
- *account_intelligence/*: a [Tensorflow](
neural network that is able to classify users in the above output as
having been compromised or not.
This particular implementation currently only works with Postfix logs.
## Implementation details
The *loganalyzer* binary computes a number of metrics (*features*) for
each user, which are then used to classify the user by the NN. These
features include, for the examined time interval:
- how many messages were sent
- how many messages were sent to "freemail" domains
- how many messages were sent unsuccessfully (bounced, due to spam
detection on the receiving end, or other causes)
- message counts for the top 10 destination domains
- how many times internal rate-limiting systems were triggered
The basic idea being that compromised accounts will try to send large
numbers of messages to a very generic set of domains (as their targets
are usually pulled out of huge lists of accounts), while normal users
will have a much more focused communication pattern.
The other signals are there because they empirically showed good
relevance as compromise detectors.
## Installation
You are going to need a few dependencies, including a Go language
environment, and [Tensorflow]( On a Debian
$ sudo apt install golang-go python-dev python-pip
$ pip install tensorflow
1) Build the *loganalyzer* binary:
$ go build -o loganalyzer loganalyzer.go
$ sudo cp loganalyzer /usr/local/bin/loganalyzer
2) Install the Python NN code:
$ sudo python install
## Usage
The analyzer requires its input to be in the standard Syslog format
with old-style (i.e. "broken") timestamps, for instance:
Apr 13 06:28:25 hostname program[pid]: foo bar
Unfortunately the input format is currently not flexible.
### Training
In order to train the neural network you'll need to train it on a
dataset that is relevant to your situation, manually classifying the
users as compromised or not. We'll call this the *training* dataset.
Furthermore, in order to properly validate the accuracy of the
network, you will need a second dataset (the *test* dataset) unrelated
to the first (for example, logs from two separate non-overlapping time
Let's see an example. Assume `mail.log.train` and `mail.log.test` are
the two datasets mentioned above. Firstly, we are going to invoke
*loganalyzer* to extract features from the logs:
$ loganalyzer mail.log.train > features.train
$ loganalyzer mail.log.test > features.test
The resulting files will consist of JSON-encoded records, one per
line, each corresponding to a different user. Something like:
{"user": "", "messages_count": 200,
"top_domains": {"": 100, "": 100}, ...}
Now you need to create separate files containing the expected
classification results for each user in the two datasets. We are going
to call them `labels.train` and `labels.test`. These files use a
simpler format, just a username and 0 or 1 to indicate a normal user
(0) or a compromised account (1). For instance: 1 0
Since usually there are going to be a lot more normal users than
compromised accounts, 0 is the default label if a user is not found in
this file, so you can just list the compromised accounts in there to
save typing.
Finally, train the NN:
$ account-intelligence-nn --train --labels=labels.train \
< features.train
and verify its accuracy on the test dataset:
$ account-intelligence-nn --labels=labels.test < features.test
Note that the NN will save its state in the local directory by
default. If the results are satisfying, you can run the analysis on
any other log file:
$ loganalyzer mail.log | account-intelligence-nn
## First results
Training the NN with a small amount of A/I mail logs resulted in a
92% accuracy on the testing set, which is quite good considering the
extremely small size of the training corpus (tens of users).
// logreader
// loganalyzer
// Generate high-level events from mail.log files.
Markdown is supported
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!