Need a way to customize ProbeFailure alert timeouts for custom probers
It's useful to have a generic ProbeFailure alert that will automatically cover additional blackbox probes, set up via our Prometheus customization hooks. Unfortunately not all probes are equal, and it makes sense to have different probe alerting intervals for different probe "types".
In fact, we already do this in ai3/config by "cheating": the ProbeFailure alert rule explicitly excludes the probeset=service metrics (and ai3/config provides its own ProbeFailure alert rule for probeset=service); this is because we know that our custom service-prober exports metrics with the "probeset=service" label. This mechanism is very custom and not easily extended, for two reasons:
- "service" is not a special value, it just happens to be what the service-prober is using
- the service-prober "probeset" label is actually hard-coded in the binary itself, so for instance there's no way to set a custom probeset label for our other custom prober (ai3-prober)
We also probably don't want to model the entire prober service configuration in float, so we'll have to find the simplest possible way, compatible with how the prometheus prober extensions are handled now, to let users define their own custom probe alert timeouts. Maybe it makes sense to re-use the "probeset" label here, in which case we'll need:
-
the ability to set custom labels on custom probers (probeset) -
the ability to set the list of probesets that should be excluded from the default ProbeFailure alert -
(maybe, though it might be stretching it a bit) have a list of probeset -> timeout pairs and automatically generate different ProbeFailure alert rules, avoiding the need to provide a copy&pasted version