Commit 75b23617 authored by ale's avatar ale
Browse files

Add a proper configuration guide

parent ed6da2b3
A data backup manager, modeling backups of high-level *datasets* and
maintaining the association with low-level hosts and files.
A data backup manager for distributed environments, modeling backups
of high-level *datasets* and maintaining the association with
low-level hosts and files.
The idea is to describe the data to be backed up in terms that make
sense to the application layer: for instance, a web hosting provider
......@@ -20,3 +21,224 @@ The following scenarios / use cases for retrieval are supported:
in order to do this, tabacco must allow you to restore datasets based
on both high-level or low-level identifiers.
To explain what this means in practice, consider a distributed
services environment: we may not care which host specifically the
service *foo* might have been running on, or where exactly on the
filesystem its data was stored; what we want is to be able to say
"restore the data for service *foo*".
# Usage
The *tabacco* command has a number of sub-commands to invoke various
## daemon
The *daemon* sub-command starts the backup agent. This is meant to run
in background as a daemon (managed by init), and it will invoke backup
jobs periodically at their desired schedule.
The daemon will read its configuration from */etc/tabacco/config.yml*
and its subdirectories by default, though this can be changed with the
`--config` option.
The process will also start a HTTP listener on an address you specify
with the `--http-addr` option, which is used to export monitoring and
debugging endpoints.
## metadb
The *metadb* sub-command starts the metadata server, the central
database (with an HTTP API) that stores data about backups. This is a
critical component as without it you can't do backups or restores.
This process is meant to run in the background (managed by init).
The daemon will read its configuration from */etc/tabacco/metadb.yml*
by default (change using the `--config` option).
# User's Guide
Let's look at some of the fundamental concepts: consider the backup
manager as a gateway between *data sources* and the destination
*storage* layer. Each high-level dataset is known as an *atom*.
There is often a trade-off to be made when backing up multi-tenant
services: do we invoke a backup handler once per tenant, or do we dump
everything once and just say it's made of multiple atoms? You can pick
the best approach on a case-by-case basis, by grouping atoms into
*datasets*. A few examples might help:
## The *file* handler
Every dataset has an associated *handler*, which is responsible for
actually taking the backup or performing the restore. The most
straightforward handler is builtin and is called *file*: it simply
backs up and restores files on the filesystem. It is configured with a
single *path* attribute pointing at the location to backup/restore.
Builtin handlers (such as *pipe* described below) are usually used as
templates for customized handlers. This is not the case with the
*file* handler, which is so simple it can be used directly.
## Datasets and atoms
Imagine a hosting provider with two FTP accounts on the local
host. The first possibility is to treat each as its own dataset, in
which case the backup command will be invoked twice:
- name: users/account1
handler: file
path: /users/account1
- name: users/account2
handler: file
path: /users/account2
For datasets without atoms, the system will create a single atom with
the same name as the dataset.
In the same scenario as above, it may be easier to simply dump all
of /users, and just say that it contains *account1* and *account2*:
- name: users
handler: file
path: /users
- name: account1
- name: account2
For datasets with one or more atoms explicitly defined, the final atom
name is the concatenation of the dataset name and the atom name. So in
this example we end up with the same identical atoms as above,
*users/account1* and *users/account2*.
## Dynamic data sources
It would be convenient to generate the list of atoms dynamically, and
in fact it is possible to do so using an *atoms\_command*:
- name: users
handler: file
path: /users
The script will be called on each backup, and it should print atom
names to its standard output, one per line.
## Pre and post scripts
Suppose the data to back up isn't just a file on the filesystem, but
rather data in a service that must be extracted somehow using tools.
It's possible to run arbitrary commands before or after a backup.
Regardless of the handler selected, all sources can define commands to
be run before and after backup or restore operations, either on the
whole dataset or individual atoms. These attributes are:
* *pre_backup_command* is invoked before a backup of the dataset
* *post_backup_command* is invoked after a backup of the dataset
* *pre_restore_command* is invoked before a restore of a dataset
* *post_restore_command* is invoked after a restore of a dataset
* *pre_atom_backup_command* is invoked before the backup of each atom
* *post_atom_backup_command* is invoked after the backup of each atom
* *pre_atom_restore_command* is invoked before the restore of each atom
* *post_atom_restore_command* is invoked after the restore of each atom
The scripts are run through a shell so they support environment
variable substitution and other shell syntax. The following special
environment variables are defined:
* `BACKUP_ID` - unique backup ID
* `DATASET_NAME` - name of the dataset
* `ATOM_NAMES` - names of all atoms, space-separated (only
available for dataset-level scripts)
* `ATOM_NAME` - atom name (only available for atom-level scripts)
So, for instance, this would be a way to make a backup of a MySQL
database instance:
- name: sql
handler: file
pre_backup_command: "mysqldump > /var/backups/sql/dump.sql"
post_restore_command: "mysql < /var/backups/sql/dump.sql"
path: /var/backups/sql
or, if you have a clever MySQL dump tool that saves each database into
a separate directory, named after the database itself, you could do
something a bit better like:
- name: sql
handler: file
pre_backup_command: "cd /var/backups/sql && clever_mysql_dump $ATOM_NAMES"
post_restore_command: "cd /var/backups/sql && clever_mysql_restore $ATOM_NAMES"
path: /var/backups/sql
- name: db1
- name: db2
This has the advantage of having the appropriate atom metadata, so we
can restore individual databases.
## The *pipe* handler
The MySQL example just above has a major disadvantage, in that it
requires writing the entire database to the local disk in
/var/backups/sql, only so that the backup tool can read it and send it
to the repository. This process can be optimized away by having a
command simply pipe its output to the backup tool, using the *pipe*
Contrary to the *file* handler seen before, the *pipe* handler can't
be used unless it is configured appropriately in a user-defined
Since it's impractical to access individual items within a single data
stream, pipe handlers operate on individual atoms: datasets containing
multiple atoms are automatically converted into a list of datasets
with one atom each. This is an internal mechanism and has almost no
practical consequences except in reports and logs, which will show the
multiple datasources.
Configuration is performed by setting two parameters:
* *backup_command* is the command to generate a backup of an atom on
standard output
* *restore_command* is the command used to restore an atom.
So, for instance, the MySQL example could be rewritten as this handler
- name: mysql-pipe
backup_command: "mysqldump --databases ${}"
restore_command: "mysql"
and this dataset source:
- name: sql
handler: mysql-pipe
- name: db1
- name: db2
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment