Immediately run a slave-cmd when losing mastership

Also updated the README with a bit more information.

Immediately run a slave-cmd when losing mastership
08f924d4 · ale · b3473177 · 08f924d4 · 08f924d4
Commit 08f924d4 authored 7 years ago by ale
--- a/README.md
+++ b/README.md
@@ -5,3 +5,54 @@ A simple utility that runs a master-election protocol on top of
 [etcd](https://github.com/coreos/etcd), and that can run
 user-specified commands on state transitions. It is meant as a
 building block for highly available services.
+# Usage
+The tool will attempt to acquire an etcd lock (currently
+`/me/`*service_name*`/lock`). If it succeeds, it will run the command
+specified by *--master-cmd*, and it will consider itself to be the
+master until one of the following conditions is true:
+* the *masterelection* tool itself is terminated
+* the connection to *etcd* becomes unavailable
+If the tool fails to acquire the lock, it will run the command
+specified by *--slave-cmd*, and it will start monitoring the lockfile
+for changes (like TTL expiry), waiting for the opportunity to acquire
+the lock again. Whenever some other node acquires the lock, it will
+run *--slave-cmd* again with the new master address.
+Commands started by *masterelection* can be long-lived (like spawning
+a daemon) or short-lived (sending an IPC message), in either case on
+every state change event, the tool will kill the previously running
+command with SIGTERM if it's still running, and immediately spawn the
+new execution.
+State is passed to commands via environment variables:
+* `IS_MASTER` will be either 1 or 0
+* `MASTER_ADDR` will contain the address of the current master
+## Failure modes
+As long as the connection to etcd is active, the state seen by the
+tool will be consistent. The issues arise when there is no longer a
+connection with etcd: in this case, the tool favors stability and will
+not issue state changes if it had a slave role. However, this behavior
+would be problematic for masters: if a master gets isolated by a
+network partition, it will continue thinking it is the master, making
+later reconciliation difficult if the other nodes constitute a
+consensus. So, when the etcd connection is lost, we issue a
+*--slave-cmd* with an empty MASTER_ADDR.
+## Examples
+A simple (and somewhat naive) example to control replication setup for
+an already-running MySQL instance, assuming you are using Global
+Transaction Identifiers:
+    $ masterelection --name=$MYHOSTNAME --service-addr=$MYADDR:3306 \
+	      --master-cmd="mysql -e 'STOP SLAVE; RESET MASTER'" \
+		  --slave-cmd="mysql -e 'CHANGE MASTER TO MASTER_HOST=\'\$MASTER_ADDR\''"
+Ok, I may have gotten the quoting wrong, but you get the idea :)
--- a/connection.go
+++ b/connection.go
@@ -227,6 +227,16 @@ func runMasterElection(ctx context.Context, api etcdclient.KeysAPI, lockPath, se
 		} else {
 			// Success, we are now the master.
 			err = runMaster(ctx, api, lockPath, self, stateFn)
+			// Once we are not the master anymore, there's
+			// the possibility that we have lost access to
+			// etcd. Issue a state change to slave with
+			// unknown master, for safety.
+			//
+			// TODO: it would be better to wait for a
+			// little while, just in case we can
+			// successfully reconnect right away and do a
+			// single master -> slave transition.
+			stateFn(ctx, stateChangeMsg{isMaster: false})
 		}
 		if err == context.Canceled {