From d6024274684db0ec48a239b4731a4cfe7b15d8ff Mon Sep 17 00:00:00 2001 From: ale <ale@incal.net> Date: Sat, 3 Jan 2015 11:07:26 +0000 Subject: [PATCH] documentation on tunable parameters --- TUNING.rst | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 100 insertions(+) create mode 100644 TUNING.rst diff --git a/TUNING.rst b/TUNING.rst new file mode 100644 index 00000000..d488d0df --- /dev/null +++ b/TUNING.rst @@ -0,0 +1,100 @@ + +====================== +autoradio Tuning Guide +====================== + +This document attempts to provide a high-level overview of the +trade-offs involved in tuning the free parameters of an autoradio +cluster. While autoradio works with the default settings out of the +box in testing environments, most real-world deployments will require +some tuning. + + +Etcd +---- + +The default settings for etcd are tuned for a local (LAN) network +environment. In the case of a geographically distributed cluster, +the default timeouts are so low that it's unlikely that a consensus +will ever be reached. You'll want to set both the peer heartbeat +interval and the election timeout to higher values. A reasonable value +for the heartbeat interval is 5x to 10x the maximum inter-node latency +in your cluster, while the election timeout should be at least 3 times +the heartbeat interval. + +With our etcd package, you can set these values in +``/etc/default/etcd`` (values are milliseconds):: + + DAEMON_OPTS="--peer-heartbeat-interval=1000 --peer-election-timeout=3000" + +Increasing the etcd timeouts causes a related increase in the time +required to reach consensus and elect a new etcd master in case of +node failure. It is advisable that the radiod master election ttl is +set to a value greater than the etcd peer election timeout. + + +Radiod timeouts +--------------- + +Similar considerations, with respect to latency, apply to the presence +and master-election protocols that are run by autoradio itself. These +are controlled by radiod's ``--heartbeat`` and +``--master-election-ttl`` command-line flags. For these time values, +though, there are further considerations to be made: + +Presence +~~~~~~~~ + +The node presence heartbeat sets the lower time bound for peers to +discover that a node is down, and stop sending client requests to it. + +It also determines how often node utilization is propagated to the +peers. This is less of a concern if one is using query cost estimators +in the load balancing policy (as it is by default). + +Setting this value too low, depending on the number of nodes in the +cluster, will cause excessive churn on etcd, leading to unnecessary +intra-cluster network traffic. As a side effect of the churn, watches +on etcd data will expire more often (due to the log position +increasing beyond the allowed horizon), which will cause more frequent +reloads of the full configuration, causing even more unnecessary +network traffic and increasing the load on etcd. + + +Master Election +~~~~~~~~~~~~~~~ + +The node master election timeout determines how quickly a source +(assuming it retries continuously on error) will be able to reconnect +to the cluster if the node that is currently the master becomes +unavailable. + + +Capacity +-------- + +One of the nice properties of the autoradio traffic control logic is +the ability to reject incoming traffic when the cluster reaches its +maximum capacity, to prevent overload and ensure that existing +connections are served reliably. This is of course only possible if +the capacity limits are set to match reality. Since these values +usually can't be guessed by autoradio, they must be set using +command-line arguments. + +Autoradio models capacity along two separate dimensions: bandwidth +(outbound), and number of connected listeners. CPU/memory are not +included due to their negligible incremental cost per-request. Limits +can be set separately for each node in the cluster, by passing the +``--bwlimit`` and ``--max-clients`` command-line flags to ``radiod``. + +The traffic control logic is then able to use utilization metrics to +make decisions about where to send traffic. For details on how this is +done, and how to control it, check the Go source documentation for the +``fe/lbv2`` package. + +The default traffic control policy only checks the number of +listeners: this is because it usually makes the most sense to express +the global cluster capacity in those terms (bandwidth is hardly a good +metric in presence of variable bitrate streams, for instance). The +disadvantage is that finding the "real" maximum capacity numbers for a +given node might take some experimentation. -- GitLab