README.md 8.84 KB
Newer Older
ale's avatar
ale committed
1 2 3
tabacco
===

ale's avatar
ale committed
4 5 6 7 8 9
A data backup manager for distributed environments, with a special
focus on multi-tenant services, which can do backups of high-level
*datasets* and maintain the association with low-level hosts and
files.

# Overview
ale's avatar
ale committed
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

The idea is to describe the data to be backed up in terms that make
sense to the application layer: for instance, a web hosting provider
may have datasets corresponding to data and sql databases for each
individual user (e.g. *data/user1*, *data/user2*, *sql/user1* etc).
The software will then map these dataset names to hosts and files, it
will back up the data and allow you to retrieve it on those terms.

The following scenarios / use cases for retrieval are supported:

* retrieve specific datasets, identified by their name
* restore an entire host or dataset group, for administrative or
  maintenance-related reasons

in order to do this, tabacco must allow you to restore datasets based
on both high-level or low-level identifiers.

ale's avatar
ale committed
27 28 29 30 31 32
To explain what this means in practice, consider a distributed
services environment: we may not care which host specifically the
service *foo* might have been running on, or where exactly on the
filesystem its data was stored; what we want is to be able to say
"restore the data for service *foo*".

ale's avatar
ale committed
33 34 35
The tabacco system works using *agents*, running on all hosts you have
data on, and a centralized metadata database (*metadb*) that stores
information about each backup globally.
ale's avatar
ale committed
36 37 38 39 40 41

# Usage

The *tabacco* command has a number of sub-commands to invoke various
functions:

ale's avatar
ale committed
42
## agent
ale's avatar
ale committed
43

ale's avatar
ale committed
44
The *agent* sub-command starts the backup agent. This is meant to run
ale's avatar
ale committed
45 46 47
in background as a daemon (managed by init), and it will invoke backup
jobs periodically at their desired schedule.

ale's avatar
ale committed
48
The daemon will read its configuration from */etc/tabacco/agent.yml*
ale's avatar
ale committed
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
and its subdirectories by default, though this can be changed with the
`--config` option.

The process will also start a HTTP listener on an address you specify
with the `--http-addr` option, which is used to export monitoring and
debugging endpoints.

## metadb

The *metadb* sub-command starts the metadata server, the central
database (with an HTTP API) that stores data about backups. This is a
critical component as without it you can't do backups or restores.
This process is meant to run in the background (managed by init).

The daemon will read its configuration from */etc/tabacco/metadb.yml*
by default (change using the `--config` option).


# User's Guide

Let's look at some of the fundamental concepts: consider the backup
manager as a gateway between *data sources* and the destination
*storage* layer. Each high-level dataset is known as an *atom*.

There is often a trade-off to be made when backing up multi-tenant
services: do we invoke a backup handler once per tenant, or do we dump
everything once and just say it's made of multiple atoms? You can pick
the best approach on a case-by-case basis, by grouping atoms into
ale's avatar
ale committed
77 78 79 80 81 82 83 84 85
*datasets*. We'll look at examples later to clarify what this means.

## Repository

The first thing your backup needs is a destination *repository*, that
is, a way to archive data long-term. The current implementation
uses [restic](https://restic.net), an encrypted-deduplicating backup
tool that supports a
[large number of remote storage options](https://restic.readthedocs.io/en/stable/030_preparing_a_new_repo.html).
ale's avatar
ale committed
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115

## The *file* handler

Every dataset has an associated *handler*, which is responsible for
actually taking the backup or performing the restore. The most
straightforward handler is builtin and is called *file*: it simply
backs up and restores files on the filesystem. It is configured with a
single *path* attribute pointing at the location to backup/restore.

Builtin handlers (such as *pipe* described below) are usually used as
templates for customized handlers. This is not the case with the
*file* handler, which is so simple it can be used directly.

## Datasets and atoms

Imagine a hosting provider with two FTP accounts on the local
host. The first possibility is to treat each as its own dataset, in
which case the backup command will be invoked twice:

```
- name: users/account1
  handler: file
  params:
    path: /users/account1
- name: users/account2
  handler: file
  params:
    path: /users/account2
```

ale's avatar
ale committed
116 117
Datasets that do not explicitly list atoms will implicitly be treated
as if they contained a single, anonymous atom.
ale's avatar
ale committed
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159

In the same scenario as above, it may be easier to simply dump all
of /users, and just say that it contains *account1* and *account2*:

```
- name: users
  handler: file
  params:
    path: /users
  atoms:
    - name: account1
    - name: account2
```

For datasets with one or more atoms explicitly defined, the final atom
name is the concatenation of the dataset name and the atom name. So in
this example we end up with the same identical atoms as above,
*users/account1* and *users/account2*.

## Dynamic data sources

It would be convenient to generate the list of atoms dynamically, and
in fact it is possible to do so using an *atoms\_command*:

```
- name: users
  handler: file
  file:
    path: /users
  atoms_command: dump_accounts.sh
```

The script will be called on each backup, and it should print atom
names to its standard output, one per line.

## Pre and post scripts

Suppose the data to back up isn't just a file on the filesystem, but
rather data in a service that must be extracted somehow using tools.
It's possible to run arbitrary commands before or after a backup.

Regardless of the handler selected, all sources can define commands to
ale's avatar
ale committed
160 161
be run before and after backup or restore operations on the whole
dataset. These attributes are:
ale's avatar
ale committed
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218

* *pre_backup_command* is invoked before a backup of the dataset
* *post_backup_command* is invoked after a backup of the dataset
* *pre_restore_command* is invoked before a restore of a dataset
* *post_restore_command* is invoked after a restore of a dataset

The scripts are run through a shell so they support environment
variable substitution and other shell syntax. The following special
environment variables are defined:

* `BACKUP_ID` - unique backup ID
* `DATASET_NAME` - name of the dataset
* `ATOM_NAMES` - names of all atoms, space-separated (only
  available for dataset-level scripts)
* `ATOM_NAME` - atom name (only available for atom-level scripts)

So, for instance, this would be a way to make a backup of a MySQL
database instance:

```
- name: sql
  handler: file
  pre_backup_command: "mysqldump > /var/backups/sql/dump.sql"
  post_restore_command: "mysql < /var/backups/sql/dump.sql"
  params:
    path: /var/backups/sql
```

or, if you have a clever MySQL dump tool that saves each database into
a separate directory, named after the database itself, you could do
something a bit better like:

```
- name: sql
  handler: file
  pre_backup_command: "cd /var/backups/sql && clever_mysql_dump $ATOM_NAMES"
  post_restore_command: "cd /var/backups/sql && clever_mysql_restore $ATOM_NAMES"
  params:
    path: /var/backups/sql
  atoms:
    - name: db1
    - name: db2
```

This has the advantage of having the appropriate atom metadata, so we
can restore individual databases.

## The *pipe* handler

The MySQL example just above has a major disadvantage, in that it
requires writing the entire database to the local disk in
/var/backups/sql, only so that the backup tool can read it and send it
to the repository. This process can be optimized away by having a
command simply pipe its output to the backup tool, using the *pipe*
handler.

Contrary to the *file* handler seen before, the *pipe* handler can't
ale's avatar
ale committed
219 220
be used unless it is configured appropriately, by creating a
user-defined handler.
ale's avatar
ale committed
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239

Since it's impractical to access individual items within a single data
stream, pipe handlers operate on individual atoms: datasets containing
multiple atoms are automatically converted into a list of datasets
with one atom each. This is an internal mechanism and has almost no
practical consequences except in reports and logs, which will show the
multiple datasources.

Configuration is performed by setting two parameters:

* *backup_command* is the command to generate a backup of an atom on
  standard output
* *restore_command* is the command used to restore an atom.

So, for instance, the MySQL example could be rewritten as this handler
definition:

```
- name: mysql-pipe
ale's avatar
ale committed
240 241 242
  params:
    backup_command: "mysqldump --databases ${atom.name}"
    restore_command: "mysql"
ale's avatar
ale committed
243 244 245 246 247 248 249 250 251 252 253 254
```

and this dataset source:

```
- name: sql
  handler: mysql-pipe
  atoms:
    - name: db1
    - name: db2
```

ale's avatar
ale committed
255 256 257 258
## Runtime signals

The agent will reload its configuration on SIGHUP, and it will
immediately trigger all backup jobs upon receiving SIGUSR1.
ale's avatar
ale committed
259

ale's avatar
ale committed
260 261 262 263 264 265 266 267 268 269 270 271
# TODO

Things still to do:

* The agent can currently do both backups and restores, but there is
  no way to trigger a restore. Some sort of authenticated API is
  needed for this.

Things not to do:

* Global (cluster-wide) scheduling - that's the job of a global cron
  scheduler, that could then easily trigger backups.