Prometheus optimizations

Some rules take quite a bit to evaluate due (most often) to the high cardinality of the source metrics.

We can try creating precomputations to reduce alert evaluation time, or to strip down data collection at the source where possible. We'll use this issue to track the specific efforts. So far:

node_systemd_unit_state is huge, which leads to high evaluation time of the SystemdUnitFailed alert
node_systemd_unit_presence and node_systemd_unit_ok are huge for similar reasons as the previous one
systemd_failed is a huge computed rule derived from node_systemd_unit_state that seems unused
all the rules in rules_cpu.conf are expensive

Edited Jun 11, 2019 by ale