From d6024274684db0ec48a239b4731a4cfe7b15d8ff Mon Sep 17 00:00:00 2001
From: ale <ale@incal.net>
Date: Sat, 3 Jan 2015 11:07:26 +0000
Subject: [PATCH] documentation on tunable parameters

---
 TUNING.rst | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)
 create mode 100644 TUNING.rst

diff --git a/TUNING.rst b/TUNING.rst
new file mode 100644
index 00000000..d488d0df
--- /dev/null
+++ b/TUNING.rst
@@ -0,0 +1,100 @@
+
+======================
+autoradio Tuning Guide
+======================
+
+This document attempts to provide a high-level overview of the
+trade-offs involved in tuning the free parameters of an autoradio
+cluster. While autoradio works with the default settings out of the
+box in testing environments, most real-world deployments will require
+some tuning.
+
+
+Etcd
+----
+
+The default settings for etcd are tuned for a local (LAN) network
+environment. In the case of a geographically distributed cluster,
+the default timeouts are so low that it's unlikely that a consensus
+will ever be reached. You'll want to set both the peer heartbeat
+interval and the election timeout to higher values. A reasonable value
+for the heartbeat interval is 5x to 10x the maximum inter-node latency
+in your cluster, while the election timeout should be at least 3 times
+the heartbeat interval.
+
+With our etcd package, you can set these values in
+``/etc/default/etcd`` (values are milliseconds)::
+
+    DAEMON_OPTS="--peer-heartbeat-interval=1000 --peer-election-timeout=3000"
+
+Increasing the etcd timeouts causes a related increase in the time
+required to reach consensus and elect a new etcd master in case of
+node failure. It is advisable that the radiod master election ttl is
+set to a value greater than the etcd peer election timeout.
+
+
+Radiod timeouts
+---------------
+
+Similar considerations, with respect to latency, apply to the presence
+and master-election protocols that are run by autoradio itself. These
+are controlled by radiod's ``--heartbeat`` and
+``--master-election-ttl`` command-line flags. For these time values,
+though, there are further considerations to be made:
+
+Presence
+~~~~~~~~
+
+The node presence heartbeat sets the lower time bound for peers to
+discover that a node is down, and stop sending client requests to it.
+
+It also determines how often node utilization is propagated to the
+peers. This is less of a concern if one is using query cost estimators
+in the load balancing policy (as it is by default).
+
+Setting this value too low, depending on the number of nodes in the
+cluster, will cause excessive churn on etcd, leading to unnecessary
+intra-cluster network traffic. As a side effect of the churn, watches
+on etcd data will expire more often (due to the log position
+increasing beyond the allowed horizon), which will cause more frequent
+reloads of the full configuration, causing even more unnecessary
+network traffic and increasing the load on etcd.
+
+
+Master Election
+~~~~~~~~~~~~~~~
+
+The node master election timeout determines how quickly a source
+(assuming it retries continuously on error) will be able to reconnect
+to the cluster if the node that is currently the master becomes
+unavailable.
+
+
+Capacity
+--------
+
+One of the nice properties of the autoradio traffic control logic is
+the ability to reject incoming traffic when the cluster reaches its
+maximum capacity, to prevent overload and ensure that existing
+connections are served reliably. This is of course only possible if
+the capacity limits are set to match reality. Since these values
+usually can't be guessed by autoradio, they must be set using
+command-line arguments.
+
+Autoradio models capacity along two separate dimensions: bandwidth
+(outbound), and number of connected listeners. CPU/memory are not
+included due to their negligible incremental cost per-request. Limits
+can be set separately for each node in the cluster, by passing the
+``--bwlimit`` and ``--max-clients`` command-line flags to ``radiod``.
+
+The traffic control logic is then able to use utilization metrics to
+make decisions about where to send traffic. For details on how this is
+done, and how to control it, check the Go source documentation for the
+``fe/lbv2`` package.
+
+The default traffic control policy only checks the number of
+listeners: this is because it usually makes the most sense to express
+the global cluster capacity in those terms (bandwidth is hardly a good
+metric in presence of variable bitrate streams, for instance). The
+disadvantage is that finding the "real" maximum capacity numbers for a
+given node might take some experimentation.
-- 
GitLab