Prometheus optimizations
Some rules take quite a bit to evaluate due (most often) to the high cardinality of the source metrics.
We can try creating precomputations to reduce alert evaluation time, or to strip down data collection at the source where possible. We'll use this issue to track the specific efforts. So far:
-
node_systemd_unit_state is huge, which leads to high evaluation time of the SystemdUnitFailed alert -
node_systemd_unit_presence and node_systemd_unit_ok are huge for similar reasons as the previous one -
systemd_failed is a huge computed rule derived from node_systemd_unit_state that seems unused -
all the rules in rules_cpu.conf are expensive