From 08f924d4acff0b3f305bd9861a77a00d03dc7d17 Mon Sep 17 00:00:00 2001 From: ale <ale@incal.net> Date: Tue, 22 Aug 2017 08:29:10 +0100 Subject: [PATCH] Immediately run a slave-cmd when losing mastership Also updated the README with a bit more information. --- README.md | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ connection.go | 10 ++++++++++ 2 files changed, 61 insertions(+) diff --git a/README.md b/README.md index 67737f4..52d6834 100644 --- a/README.md +++ b/README.md @@ -5,3 +5,54 @@ A simple utility that runs a master-election protocol on top of [etcd](https://github.com/coreos/etcd), and that can run user-specified commands on state transitions. It is meant as a building block for highly available services. + +# Usage + +The tool will attempt to acquire an etcd lock (currently +`/me/`*service_name*`/lock`). If it succeeds, it will run the command +specified by *--master-cmd*, and it will consider itself to be the +master until one of the following conditions is true: + +* the *masterelection* tool itself is terminated +* the connection to *etcd* becomes unavailable + +If the tool fails to acquire the lock, it will run the command +specified by *--slave-cmd*, and it will start monitoring the lockfile +for changes (like TTL expiry), waiting for the opportunity to acquire +the lock again. Whenever some other node acquires the lock, it will +run *--slave-cmd* again with the new master address. + +Commands started by *masterelection* can be long-lived (like spawning +a daemon) or short-lived (sending an IPC message), in either case on +every state change event, the tool will kill the previously running +command with SIGTERM if it's still running, and immediately spawn the +new execution. + +State is passed to commands via environment variables: + +* `IS_MASTER` will be either 1 or 0 +* `MASTER_ADDR` will contain the address of the current master + +## Failure modes + +As long as the connection to etcd is active, the state seen by the +tool will be consistent. The issues arise when there is no longer a +connection with etcd: in this case, the tool favors stability and will +not issue state changes if it had a slave role. However, this behavior +would be problematic for masters: if a master gets isolated by a +network partition, it will continue thinking it is the master, making +later reconciliation difficult if the other nodes constitute a +consensus. So, when the etcd connection is lost, we issue a +*--slave-cmd* with an empty MASTER_ADDR. + +## Examples + +A simple (and somewhat naive) example to control replication setup for +an already-running MySQL instance, assuming you are using Global +Transaction Identifiers: + + $ masterelection --name=$MYHOSTNAME --service-addr=$MYADDR:3306 \ + --master-cmd="mysql -e 'STOP SLAVE; RESET MASTER'" \ + --slave-cmd="mysql -e 'CHANGE MASTER TO MASTER_HOST=\'\$MASTER_ADDR\''" + +Ok, I may have gotten the quoting wrong, but you get the idea :) diff --git a/connection.go b/connection.go index 464c80c..4b867e2 100644 --- a/connection.go +++ b/connection.go @@ -227,6 +227,16 @@ func runMasterElection(ctx context.Context, api etcdclient.KeysAPI, lockPath, se } else { // Success, we are now the master. err = runMaster(ctx, api, lockPath, self, stateFn) + // Once we are not the master anymore, there's + // the possibility that we have lost access to + // etcd. Issue a state change to slave with + // unknown master, for safety. + // + // TODO: it would be better to wait for a + // little while, just in case we can + // successfully reconnect right away and do a + // single master -> slave transition. + stateFn(ctx, stateChangeMsg{isMaster: false}) } if err == context.Canceled { -- GitLab