replds
Maintains a (small) set of files, replicated across multiple servers. It is targeted at small datasets that are managed by automation workflows and need to be propagated to machines at runtime.
Data replication is eventually consistent, conflict resolution applies last-write-wins semantics. Writes are immediately forwarded to all peers, but at most one copy must succeed in order for the write to be acknowleged successfully. The last written data will appear on all nodes as soon as network partitions are resolved.
Given the replication model, this is not safe to use with multiple writers on overlapping key space. For read-modify-update workflows, it is best to implement a separate locking mechanism so that only a single workflow accesses the data at any given time (since there is no locking in this service itself, this is necessary to prevent out-of-order unexpected updates).
There is no dynamic cluster control: the full list of peers must be provided to each daemon. This suggests the usage of a configuration management system to generate the daemon configuration.
Configuration
The replds tool requires a YAML-encoded configuration file (which you can specify using the --config command-line option). This file should contain the following attributes:
-
client
- configuration for the replds client commands-
url
- service URL (the hostname can resolve to multiple IP addresses) -
tls
- TLS configuration for the client-
cert
- path to the certificate -
key
- path to the private key -
ca
- path to the CA file
-
-
-
server
- configuration for the replds server command-
path
- path of the locally managed repository -
peers
- list of URLs of cluster peers -
tls_client
- TLS configuration for the peer-to-peer client-
cert
- path to the certificate -
key
- path to the private key -
ca
- path to the CA file
-
-
-
http_server
- configuration for the HTTP server-
tls
- server-side TLS configuration-
cert
- path to the server certificate -
key
- path to the server's private key -
ca
- path to the CA used to validate clients -
acl
- TLS-based access controls, a list of entries with the following attributes:-
path
is a regular expression to match the request URL path -
cn
is a regular expression that must match the CommonName part of the subject of the client certificate
-
-
-
max_inflight_requests
- maximum number of in-flight requests to allow before server-side throttling kicks in
-
TLS Setup
For safe usage, you will want to secure peer-to-peer and client-to-peer communication with TLS, with separate credentials. Then, you can set ACLs to only allow the /api/internal/ URL prefix for peers, and everything else under /api/ for all clients.
Service integration
The replication strategy adopted by replds puts severe limits on how it can be used, however there are at least two useful use cases that we'd like to examine in more detail. In both cases, there is a single master server that controls the workflow (i.e. the key space is not partitioned).
Letsencrypt automation
In this scenario, SSL certificates are automatically generated at runtime with Letsencrypt (from a cron job), and we need to propagate them to front-end servers.
This scenario is relatively simple because the timeouts and delays involved in the workflow are so much greater than propagation delays and expected fault durations that data convergence is not an issue: when we refresh a SSL certificate 30 days before its expiration, it's fine if it gets picked up by application servers within a day or more.
The workflow is going to look like this:
- A cron job (on a single node) examines the local repository to find certificates that are about to expire, and renews them using the ACME API. We are ignoring the details of the challenge/response validation process as they are not relevant to data propagation issues.
- The cron job stores the results in replds.
- Periodically, the application servers are reloaded to pick up the new certificates, possibly via another cron job.
Using an independent data reload cycle, it is potentially possible to end up in a situation where the application is reloaded when the certificate and the private key do not (yet) match. One possible strategy for handling this situation is for the service to crash, and rely on an automatic service restart policy to keep trying to start it again until the data is up to date: not optimal perhaps, but simple and guaranteed to converge.
Package repository
Here, we need to propagate a Debian package repository across multiple servers for redundancy. The incoming packages are sent to the master repository server (in our case, over SSH), where some processing takes place that results in a bunch of files being updated (the new packages, and the repository metadata). This processing stage needs to access the entire repository.
We're wrapping external functionality and tools, and they may be complex enough that we can't simply make them use the replds API, so we're going to let the tools use the local filesystem as they normally would. At the same time, we can't just run the repository tools on the filesystem copy managed by replds itself, because in that case we would not be able to detect changes. So we use a separate staging directory to run the repository tools on, and the final workflow is:
- rsync data from the replds-managed dir to the staging dir;
- run the metadata-generation tools on the staging dir;
- synchronize the data back to replds using the sync command.