tabacco
A data backup manager for distributed environments, with a special focus on multi-tenant services, which can do backups of high-level datasets and maintain the association with low-level hosts and files.
Overview
The idea is to describe the data to be backed up in terms that make sense to the application layer: for instance, a web hosting provider may have datasets corresponding to data and sql databases for each individual user (e.g. data/user1, data/user2, sql/user1 etc). The software will then map these dataset names to hosts and files, it will back up the data and allow you to retrieve it on those terms.
The following scenarios / use cases for retrieval are supported:
- retrieve specific datasets, identified by their name
- restore an entire host or dataset group, for administrative or maintenance-related reasons
in order to do this, tabacco must allow you to restore datasets based on both high-level or low-level identifiers.
To explain what this means in practice, consider a distributed services environment: we may not care which host specifically the service foo might have been running on, or where exactly on the filesystem its data was stored; what we want is to be able to say "restore the data for service foo".
The tabacco system works using agents, running on all hosts you have data on, and a centralized metadata database (metadb) that stores information about each backup globally.
Usage
The tabacco command has a number of sub-commands to invoke various functions:
agent
The agent sub-command starts the backup agent. This is meant to run in background as a daemon (managed by init), and it will invoke backup jobs periodically at their desired schedule.
The daemon will read its configuration from /etc/tabacco/agent.yml
and its subdirectories by default, though this can be changed with the
--config
option.
The process will also start a HTTP listener on an address you specify
with the --http-addr
option, which is used to export monitoring and
debugging endpoints.
metadb
The metadb sub-command starts the metadata server, the central database (with an HTTP API) that stores data about backups. This is a critical component as without it you can't do backups or restores. This process is meant to run in the background (managed by init).
The daemon will read its configuration from /etc/tabacco/metadb.yml
by default (change using the --config
option).
User's Guide
Let's look at some of the fundamental concepts: consider the backup manager as a gateway between data sources and the destination storage layer. Each high-level dataset is known as an atom.
There is often a trade-off to be made when backing up multi-tenant services: do we invoke a backup handler once per tenant, or do we dump everything once and just say it's made of multiple atoms? You can pick the best approach on a case-by-case basis, by grouping atoms into datasets. We'll look at examples later to clarify what this means.
Repository
The first thing your backup needs is a destination repository, that is, a way to archive data long-term. The current implementation uses restic, an encrypted-deduplicating backup tool that supports a large number of remote storage options.
The file handler
Every dataset has an associated handler, which is responsible for actually taking the backup or performing the restore. The most straightforward handler is builtin and is called file: it simply backs up and restores files on the filesystem. It is configured with a single path attribute pointing at the location to backup/restore.
Builtin handlers (such as pipe described below) are usually used as templates for customized handlers. This is not the case with the file handler, which is so simple it can be used directly.
Datasets and atoms
Imagine a hosting provider with two FTP accounts on the local host. The first possibility is to treat each as its own dataset, in which case the backup command will be invoked twice:
- name: users/account1
handler: file
params:
path: /users/account1
- name: users/account2
handler: file
params:
path: /users/account2
Datasets that do not explicitly list atoms will implicitly be treated as if they contained a single, anonymous atom.
In the same scenario as above, it may be easier to simply dump all of /users, and just say that it contains account1 and account2:
- name: users
handler: file
params:
path: /users
atoms:
- name: account1
- name: account2
For datasets with one or more atoms explicitly defined, the final atom name is the concatenation of the dataset name and the atom name. So in this example we end up with the same identical atoms as above, users/account1 and users/account2.
Dynamic data sources
It would be convenient to generate the list of atoms dynamically, and in fact it is possible to do so using an atoms_command:
- name: users
handler: file
file:
path: /users
atoms_command: dump_accounts.sh
The script will be called on each backup, and it should print atom names to its standard output, one per line.
Pre and post scripts
Suppose the data to back up isn't just a file on the filesystem, but rather data in a service that must be extracted somehow using tools. It's possible to run arbitrary commands before or after a backup.
Regardless of the handler selected, all sources can define commands to be run before and after backup or restore operations on the whole dataset. These attributes are:
- pre_backup_command is invoked before a backup of the dataset
- post_backup_command is invoked after a backup of the dataset
- pre_restore_command is invoked before a restore of a dataset
- post_restore_command is invoked after a restore of a dataset
The scripts are run through a shell so they support environment variable substitution and other shell syntax. The following special environment variables are defined:
-
BACKUP_ID
- unique backup ID -
DATASET_NAME
- name of the dataset -
ATOM_NAMES
- names of all atoms, space-separated (only available for dataset-level scripts) -
ATOM_NAME
- atom name (only available for atom-level scripts)
So, for instance, this would be a way to make a backup of a MySQL database instance:
- name: sql
handler: file
pre_backup_command: "mysqldump > /var/backups/sql/dump.sql"
post_restore_command: "mysql < /var/backups/sql/dump.sql"
params:
path: /var/backups/sql
or, if you have a clever MySQL dump tool that saves each database into a separate directory, named after the database itself, you could do something a bit better like:
- name: sql
handler: file
pre_backup_command: "cd /var/backups/sql && clever_mysql_dump $ATOM_NAMES"
post_restore_command: "cd /var/backups/sql && clever_mysql_restore $ATOM_NAMES"
params:
path: /var/backups/sql
atoms:
- name: db1
- name: db2
This has the advantage of having the appropriate atom metadata, so we can restore individual databases.
The pipe handler
The MySQL example just above has a major disadvantage, in that it requires writing the entire database to the local disk in /var/backups/sql, only so that the backup tool can read it and send it to the repository. This process can be optimized away by having a command simply pipe its output to the backup tool, using the pipe handler.
Contrary to the file handler seen before, the pipe handler can't be used unless it is configured appropriately, by creating a user-defined handler.
Since it's impractical to access individual items within a single data stream, pipe handlers operate on individual atoms: datasets containing multiple atoms are automatically converted into a list of datasets with one atom each. This is an internal mechanism and has almost no practical consequences except in reports and logs, which will show the multiple datasources.
Configuration is performed by setting two parameters:
- backup_command is the command to generate a backup of an atom on standard output
- restore_command is the command used to restore an atom.
So, for instance, the MySQL example could be rewritten as this handler definition:
- name: mysql-pipe
params:
backup_command: "mysqldump --databases ${atom.name}"
restore_command: "mysql"
and this dataset source:
- name: sql
handler: mysql-pipe
atoms:
- name: db1
- name: db2
Runtime signals
The agent will reload its configuration on SIGHUP, and it will immediately trigger all backup jobs upon receiving SIGUSR1.
TODO
Things still to do:
- The agent can currently do both backups and restores, but there is no way to trigger a restore. Some sort of authenticated API is needed for this.
Things not to do:
- Global (cluster-wide) scheduling - that's the job of a global cron scheduler, that could then easily trigger backups.