float issueshttps://git.autistici.org/ai3/float/-/issues2024-03-14T09:36:11Zhttps://git.autistici.org/ai3/float/-/issues/148X509 PKI CA renewal is broken2024-03-14T09:36:11ZaleX509 PKI CA renewal is brokenWhen the PKI X509 CA (used for internal mTLS) expires, float will *not* re-generate all mTLS certificates.
This can be currently mitigated by running float with "-e force_renew_credentials=true" manually, which will forcefully regenerat...When the PKI X509 CA (used for internal mTLS) expires, float will *not* re-generate all mTLS certificates.
This can be currently mitigated by running float with "-e force_renew_credentials=true" manually, which will forcefully regenerate all mTLS certificates (and restart the associated services/containers).https://git.autistici.org/ai3/float/-/issues/146When multiple services on the same host use the same container image, only on...2023-11-13T13:53:33ZaleWhen multiple services on the same host use the same container image, only one gets restarted on updateLikely a deduping issue with the Ansible task that calls float-pull-image?Likely a deduping issue with the Ansible task that calls float-pull-image?https://git.autistici.org/ai3/float/-/issues/144Replace Elasticsearch with Clickhouse2023-08-22T07:29:16ZaleReplace Elasticsearch with ClickhouseClickhouse might be more suited to the low-resource use case and might generally scale better to the high-resources one - we'd lose Kibana, but there is not much there that can't be replaced by a simpler dashboarding / query UI.Clickhouse might be more suited to the low-resource use case and might generally scale better to the high-resources one - we'd lose Kibana, but there is not much there that can't be replaced by a simpler dashboarding / query UI.https://git.autistici.org/ai3/float/-/issues/143Model data control flow in logs2023-08-22T07:25:55ZaleModel data control flow in logsWe're using syslog as the generalized transport for asynchronous messages, at least those that are expected to end up in a searchable database somewhere -- so it would be nice to be able to model these data flows explicitly (switching on...We're using syslog as the generalized transport for asynchronous messages, at least those that are expected to end up in a searchable database somewhere -- so it would be nice to be able to model these data flows explicitly (switching on *log_type* attribute, for instance?) and describe them in a way that float would understand, and configure the system accordingly.
In line with this thinking, it would be nice to be able to set up *log consumers* that are not searchable databases, for example for the purpose of *log watching* (for periodic / real-time analysis, or alerting)...https://git.autistici.org/ai3/float/-/issues/142Tinc does not delete old host keys2023-05-27T16:58:32ZaleTinc does not delete old host keysIf a host is removed from the inventory, float will not remove its tinc host configuration file, which might cause conflicts in case of IP re-use etc.If a host is removed from the inventory, float will not remove its tinc host configuration file, which might cause conflicts in case of IP re-use etc.https://git.autistici.org/ai3/float/-/issues/141Replace "zonetool" with "dnscontrol"2023-03-07T09:10:52ZaleReplace "zonetool" with "dnscontrol"https://github.com/StackExchange/dnscontrolhttps://github.com/StackExchange/dnscontrolhttps://git.autistici.org/ai3/float/-/issues/139Add Crowdsec support2023-03-07T09:11:02ZaleAdd Crowdsec supportThe functionality of [crowdsec](https://www.crowdsec.net/) seems very interesting for the float reverse proxy, in particular the possibility to implement "milder" ban actions for rate limiting such as requiring a captcha (better than out...The functionality of [crowdsec](https://www.crowdsec.net/) seems very interesting for the float reverse proxy, in particular the possibility to implement "milder" ban actions for rate limiting such as requiring a captcha (better than outright IP blocks).https://git.autistici.org/ai3/float/-/issues/138Fix ordering of volume and dataset creation2022-05-10T12:09:42ZaleFix ordering of volume and dataset creationIt seems that right now datasets run before volumes, so if you have them nested, the dataset directories will disappear on the first run (to be re-created on the second, inside the newly created volume).It seems that right now datasets run before volumes, so if you have them nested, the dataset directories will disappear on the first run (to be re-created on the second, inside the newly created volume).https://git.autistici.org/ai3/float/-/issues/135Document correct usage of become_method2022-02-09T10:04:48ZaleDocument correct usage of become_methodSo float actually needs to support two authentication scenarios:
* in test environments, log in as unprivileged user ("vagrant" usually), use "sudo" to escalate
* in production environments, log in as root, and we would like not to depe...So float actually needs to support two authentication scenarios:
* in test environments, log in as unprivileged user ("vagrant" usually), use "sudo" to escalate
* in production environments, log in as root, and we would like not to depend on "sudo" because it is not actually useful in our user-less model
So far we've relied on "sudo" being present by default in the test VM images (as it is in Vagrant's images), but production installations are different and we should have minimal, and explicit, dependencies there. A side effect of the current situation is that "become" won't actually work in production by default, because the default become_method is "sudo".https://git.autistici.org/ai3/float/-/issues/132Use histograms in the mtail nginx program2022-02-02T16:59:21ZaleUse histograms in the mtail nginx programWe're using the old-fashioned manual way to export latency histrograms of HTTP requests, but the mtail version in Debian stable (3.0-rc43) probably now supports the "histogram" syntax (https://github.com/google/mtail/blob/main/docs/Progr...We're using the old-fashioned manual way to export latency histrograms of HTTP requests, but the mtail version in Debian stable (3.0-rc43) probably now supports the "histogram" syntax (https://github.com/google/mtail/blob/main/docs/Programming-Guide.md#histograms) which is better.https://git.autistici.org/ai3/float/-/issues/129docker-cleanup might remove images which are in use2022-10-25T14:00:46Zaledocker-cleanup might remove images which are in useThe docker-cleanup script (which just calls "docker system prune") can mistakenly remove images that are referenced by float systemd units, because of timing issues: if the container is restarted (or it is crash-looping for some external...The docker-cleanup script (which just calls "docker system prune") can mistakenly remove images that are referenced by float systemd units, because of timing issues: if the container is restarted (or it is crash-looping for some external reason), docker-cleanup can "catch it" when it is not running, and it will proceed to remove it. The situation becomes unrecoverable until ansible is run manually again.
I am afraid that we're going to need to write a custom replacement for "docker system prune", that incorporates knowledge about float containers, and will avoid cleaning them up by mistake.https://git.autistici.org/ai3/float/-/issues/126Switch to --log-driver=journald2022-01-31T17:54:32ZaleSwitch to --log-driver=journaldThis would allow us to get rid of the extra "podman" process in our systemd units (by running it with the -d option). The Podman version in Debian stable (3.0.1) unfortunately does not handle the --log-opt=tag=*foo* option, so it doesn't...This would allow us to get rid of the extra "podman" process in our systemd units (by running it with the -d option). The Podman version in Debian stable (3.0.1) unfortunately does not handle the --log-opt=tag=*foo* option, so it doesn't set the SYSLOG_IDENTIFIER tag properly and everything shows up in syslog as "conman".
An alternative would be to use --log-driver=passthrough, which has been implemented in late 2021 in https://github.com/containers/podman/pull/11390. The difference is that the journald driver adds additional metadata, (container id, etc) which would be useful if we eventually found a way to get it to rsyslog...https://git.autistici.org/ai3/float/-/issues/122Replace replds with replds22023-08-22T07:30:50ZaleReplace replds with replds2Provides a better model for ACL-controlled distribution of selected credentials.Provides a better model for ACL-controlled distribution of selected credentials.https://git.autistici.org/ai3/float/-/issues/113Implement a sensible "service turndown" workflow2021-09-29T15:57:26ZaleImplement a sensible "service turndown" workflowRight now there's no "clean" way to turn down a service, if removed from services.yml many artifacts (including running containers) will be left over and not cleaned up by float, because float doesn't touch things it does not "own".
A p...Right now there's no "clean" way to turn down a service, if removed from services.yml many artifacts (including running containers) will be left over and not cleaned up by float, because float doesn't touch things it does not "own".
A possible solution from the user's perspective could be: set enabled=false on the service metadata, and do a push to remove all associated artifacts. To implement this, we're going to need to modify the float.py plugin to not filter out 'enabled=false' services but to instead keep including them without any hosts assignment.https://git.autistici.org/ai3/float/-/issues/111Simplify the HTTP log pipeline2021-05-29T17:01:55ZaleSimplify the HTTP log pipelineNow we have NGINX write traditional common-log-style logs, and then use mmlognorm in the log-collector to parse them into structured data. There is actually no need for this, we could have NGINX generate lumberjack-style logs (with the `...Now we have NGINX write traditional common-log-style logs, and then use mmlognorm in the log-collector to parse them into structured data. There is actually no need for this, we could have NGINX generate lumberjack-style logs (with the `@cee` tag) directly, since the log_format directive supports JSON escaping of variables. An example here: https://ahelpme.com/software/rsyslog/send-access-logs-in-json-to-elasticsearch-using-rsyslog/https://git.autistici.org/ai3/float/-/issues/105Consider adding a "configuration file" abstraction2021-04-26T12:18:25ZaleConsider adding a "configuration file" abstractionWhile it is nice to offer the ability to configure containerized services via Ansible (because it allows arbitrary customization, besides being necessary for non-containerized services), it is true that the best practice envisions servic...While it is nice to offer the ability to configure containerized services via Ansible (because it allows arbitrary customization, besides being necessary for non-containerized services), it is true that the best practice envisions service-specific Ansible roles as only being responsible for generating some configuration files, possibly using templates.
It is then worth considering, with the intent of "hiding" Ansible as much as possible unless strictly necessary, if we could add a "configuration file" abstraction to the float service metadata, which would set up configuration files on the filesystem using Ansible templates. This would cover a lot of use cases, which would then no longer require an associated trivial Ansible role for configuration.
One of the obvious downsides is that it makes for a lot of ugly YAML, but this can be partially mitigated by using includes (eventually splitting down service metadata to one-service-per-file or such).https://git.autistici.org/ai3/float/-/issues/104Live dataset migration2021-11-25T09:04:28ZaleLive dataset migrationCurrently the mechanism for migrating datasets is to restore the latest backup on the new host, which introduces a worst-case 1-day data loss. While this is more or less fine for most of the services currently in float (that can easily t...Currently the mechanism for migrating datasets is to restore the latest backup on the new host, which introduces a worst-case 1-day data loss. While this is more or less fine for most of the services currently in float (that can easily tolerate data loss), and it's the right thing to do when the original host has failed, it's kind of an ugly constraint to have if the original data is still "right there", and it would be much better to have the capability for live dataset migration.
This could easily be implemented as a global rsync service, though it would introduce an avenue for lateral data movement between hosts (once there is local root compromise). On the other hand, this is already possible via the backup system since we have automated transparent restores on different hosts by design.https://git.autistici.org/ai3/float/-/issues/102Redesign DNS and service discovery integration2021-09-29T14:27:00ZaleRedesign DNS and service discovery integrationThere are a few issues to touch around DNS and how service discovery interacts with it:
1) getting rid of the static /etc/hosts
While very convenient, relying on /etc/hosts for service discovery has the extremely annoying consequence t...There are a few issues to touch around DNS and how service discovery interacts with it:
1) getting rid of the static /etc/hosts
While very convenient, relying on /etc/hosts for service discovery has the extremely annoying consequence that the service containers do not see updates to it: to maintain correctness, we are forced to restart all containers whenever there is a change, which is disruptive.
Moving service discovery to use "real" DNS would solve this issue.
2) authoritative vs recursive DNS request flows
Right now the DNS server installed as part of the *frontend* role also operates as a caching nameserver. This is a practical optimization that ends up being a bit confusing: the purpose of the service, after all, is exclusively to serve authoritative DNS zones to the public.
DNS caching, and generally handling resolv.conf in a more structured way, is probably best handled as a separate, orthogonal configuration space: it should be possible to control its deployment and usage (from the "local caching everywhere" model to "dedicated service"), keeping an eye on the eventual possible integration of service discovery.
3) testing requirements
An argument against caching/authoritative separation is that the test environments rely on the *public* zones being reachable from inside the infrastructure (where the tests run), so we have to find a solution for that too.shammashshammashhttps://git.autistici.org/ai3/float/-/issues/92Unify volumes and datasets2021-04-26T10:40:47ZaleUnify volumes and datasetsVolumes and datasets are two different concepts in float ("LVs" and "backups") but they both claim ownership to a directory associated with a service. We should at this point save the effort of creating this directory via Ansible, and in...Volumes and datasets are two different concepts in float ("LVs" and "backups") but they both claim ownership to a directory associated with a service. We should at this point save the effort of creating this directory via Ansible, and introduce some different organization of service-associated concepts that suits us better.
One obstacle is that datasets can be something else other than paths, and it's nice to have a single view of the backup-able data. Perhaps we can reduce volumes to be simply LVs (no associated ownership, just a size), and make the "path"-type dataset the primary owner of the directory, with associated ownership and permissions (which we need in any case for restores).https://git.autistici.org/ai3/float/-/issues/86Refactor tinc key management2020-11-27T17:56:11ZaleRefactor tinc key managementCurrently the "tinc" Ansible role is possibly the last remaining case where we're using facts from other hosts: this can represent a problem if those other hosts are unreachable, and it's generally undesirable as it increases the complex...Currently the "tinc" Ansible role is possibly the last remaining case where we're using facts from other hosts: this can represent a problem if those other hosts are unreachable, and it's generally undesirable as it increases the complexity of the Ansible side of things.
Instead, we should treat it as another PKI (like the internal x509 one), and store the public keys on the controlling host, in the credentials repository.