reference.md
float Reference
Table of contents
- float Reference
- Services
- Infrastructure Part 1: Base Layer
- Infrastructure Part 2: Cluster Services
- Configuration
- Operations
- Example scenarios
Services
The fundamental concept in float is the service, a loose articulation of compute elements (containers, system-level daemons) and data, with standardized metadata properties that float uses to integrate it with its infrastructure.
A description of the service (its specification) is provided to float as part of its configuration. Float can manage an arbitrary number of services on a single infrastructure. The mapping between a "float service" and a "high-level, user-visible service" is quite often not one-to-one: float services are meant to describe the high-level service's internal architecture, which can be made up of multiple components.
In float, a service is also a schedulable unit: the service specification is in fact a template for specific service instances, of which there can be again more than one; consider for example the case of service replication for reliability / high-availability purposes. Float will never schedule more than one instance of a service on each host.
The decision to assign an instance of a service to a specific host is called scheduling, and it is completely controlled by float based on parameters of the service specification: it is possible to control the desired number of instances, and to restrict the choice of possible hosts by using host groups (leveraging the Ansible host group concept). The operator has no control on the specific assignments beyond that, and they may change at any time.
Float's scheduler is not particularly smart. It does not perform any kind of bin-packing (it only looks at instance counts), and, most importantly, it is offline-only: scheduling is only performed when Ansible runs, there is no online component to rebalance instances when they fail or to react to changes in load etc.
The float scheduler produces stable and reproducible results, which is very important given how it is meant to be used. The randomness used by the scheduling algorithm is seeded on the configuration itself, so two people running float on the same configuration on two different machines will obtain identical assignments.
Compute units
Compute units in float are long-running processes controlled via systemd. Float creates a systemd unit for every service instance on the hosts where it is supposed to run.
It is possible to specify more than one compute unit for the same service: in this case, they will all be scheduled together as part of the same service instance, and they will be reachable under the same service name (supposedly they will be using different network ports for communication).
Structuring services in terms of compute units
To show one possible approach to the subdivision of a service into fundamental compute units, we're going to show an example scenario and demonstrate the reasoning behind some possible choices, and how they relate to concepts in float.
Let's consider as an example a fairly typical two-layer service that uses Apache + PHP to serve a website, ignoring an eventual MySQL for now. This is a request-based service so some of the considerations that we're going to make will be specific to this perspective.
There are two major possibilities for representing such a service within float:
- a float service "apache" and another float service "php", which may potentially be scheduled on different hosts, and that talk to each other over the service boundary: apache finds php endpoints using float's service discovery mechanism (i.e. DNS);
- a single float service "web", with apache and php bundled together
as a single unit of deployment, where apache talks to php inside
the service boundary (i.e. it connects to localhost). This scenario
can be further split into two:
- the float service "web" consists of an apache container and a separate php container, each runs independently of the other, they talk to each other either over the network (on localhost), or via an explicitly shared mechanism on the host (for instance a shared /run/web/sockets directory);
- the float service "web" consists of a single container that bundles together apache and php, maybe here they talk to each other via a /run/web/sockets directory that is completely internal to the container itself.
Obviously the first problem to solve is that abstractions must make sense to you and to the specific problem you're solving. Here the "apache" and "php" components were pretty obvious choices for the two-layer service we were considering.
The second thing to consider in terms of float architecture is what we want the request flow to be: how, specifically, each component in our service stack is supposed to talk to the following one as the request flows downstream through the layers. Including float's reverse proxy layer in the picture, the conceptual flow is quite simple:
reverse proxy
|
V
apache
|
V
php
These components may be scheduled on different hosts (or not), so one thing to consider is what the latency at each step will be. Generally, as you move downwards the service stack, there is also a fan-out factor to consider: consider a PHP script making multiple MySQL requests, for instance.
The choice of representation depends on a number of different criteria and decisions, of which we'll name a few:
- A good question to consider is "what kind of actions do you want to take in order to scale your service"? Maybe you run a datacenter, servers are just bare compute capacity for you, and can just add new ones when apache or php look busy, independently, in which case you'd go towards scenario #1 (closed). Or perhaps your service is data-partitioned, and to add a new server means moving some of your data to it, in which case it would makes sense to co-locate apache and php with the data, which makes scenario #2 (closed) look more suitable.
- If your service is distributed among hosts in different locations, you might like scenario #2 (closed) more as it contains the latency to the reverse proxy -> apache hop.
- For scenario #2 (closed), to decide amongst its two variants, another good question is "how do you like to build your containers"? This is a release engineering topic that depends on your CI, on what your upstreams look like, etc.
In terms of container bundling (#2 (closed).1 vs #2 (closed).2 above), we like our containers do to "one thing", for whatever definition of "thing" you find useful (provide a service, for example), so we run an init daemon inside our containers to differentiate between important processes, that control the lifecycle of the container itself, and non-important ones that can simply be restarted whenever they fail.
For instance, let's consider a hypothetical mailing list service: this has at least two major inbound APIs, a SMTP entry point for message submission, and an HTTP API for mailing list management. These are implemented by separate processes. We also want to run, say, a Prometheus exporter, yet another separate process: but we don't particularly care about its fate as long as it is running, and anyway monitoring will tell us if it's not running, so this process is "less important" than the first two. We would have these three processes in a single container, with the first two marked as "important" (i.e. the container will terminate when they exit, signaling a failure to float through systemd and monitoring), while the exporter would be marked as not-important and simply silently restarted whenever it fails.
Containers
The primary form of compute unit in float is a container. Float will automatically download the images and run the containers specified in the service description.
Though it is possible to run all kinds of container images, float is explicitly tuned towards a very particular kind of it:
- logs to standard output/error
- can run as an arbitrary (not known in advance) user ID, does not require to be run as internal uid 0
- can run from a read-only root filesystem image (except for the usual /tmp /run etc)
- can be configured via environment variables
Such containers will result in the least amount of additional configuration in the service description.
Float can use either docker or Podman for running containers, though on modern systems (Debian buster and up) it will choose Podman.
System-level daemons
Since float controls services by representing them as systemd units, it is also possible to create services that are made of system-level processes, i.e. "normal" systemd units.
This is convenient, for instance, when you are migrating an existing infrastructure to float, and you want to control the pace of containerization of your services: if you can describe your service in terms float understands, you can continue to configure it at the system level using Ansible and at the same time take advantage of float's infrastructural services.
Networking
Compute units in float share the host network namespace, network daemons are normally expected to bind to 0.0.0.0, while float manages the firewall to prevent unauthorized access.
Since the service mechanism discovery provides no way to do port resolution, all instances of a service are expected to use the same ports.
Float supports automatic provisioning of the firewall for TCP ports, in the following way:
- ports specified in a container description can be reached by other containers running on the same host, specifically other containers parts of the same service, as localhost;
- ports specified in the service description can be reached by other hosts (over the float infrastructure internal network overlay);
- ports that are part of the service's public endpoints can be reached by the float frontend reverse gateway hosts, over the internal network overlay;
- ports that are part of the service's monitoring endpoints can be reached by the hosts where the monitoring scrapers are running, over the internal network overlay.
In no case float will allow public access to a service. If this is desired, or if a service requires unsupported networking configuration (UDP, other protocols, etc.), it has to be achieved by adding the relevant firewall configuration snippet manually via the service's Ansible role.
Users and permissions
For isolation purposes, float will create a dedicated UNIX user and
group for every service on the hosts where its instances run. For
historical reasons, this user will be called docker-
servicename.
The containers will be run as this user, unless explicitly configured
otherwise.
If you need to share data with the container, for instance by mounting data volumes, use this user (or group) for filesystem permissions.
If the service has any service_credentials, a dedicated UNIX group
will be created for those named credentials-name-credentials
, and
the service user will be a member of it.
Data
Datasets allow you to describe data that is attached to a service: this information will be used to automatically configure the backup system. A dataset is either a local filesystem path, or something that can be dumped/loaded via a pipe. It is associated with every instance of the service, so it usually identifies local data. This assumes a partitioned service by default. But master-elected services can use the on_master_only option to make backups of global, replicated datasets only once (on the service master host).
Backups
If provided with credentials for an external data repository, float will automatically make backups of your configured datasets. Float runs its own backup management system (tabacco) on top of Restic, which adds additional metadata to Restic snapshots to map float datasets.
When a service is scheduled on a new host, for instance as a result of a re-scheduling, float will attempt to restore the associated datasets from their backups. While this is not a practical failover solution for complex services, we've found it works pretty well for a category of services with "important, but small - can afford to lose one day of changes" datasets that is quite common and useful in itself. For these services, running with num_instances=1 and counting on the backup/restore data move mechanism might provide sufficient availability and reliability.
Restores can of course also be triggered manually whenever necessary.
Volumes
Volumes represent LVs that are associated with a service. They are managed by float, which makes sure that they are present on the hosts. Volumes aren't currently being ever removed, because we're scared of deleting data.
SSL Credentials
In the spirit of separation between internal and user-facing concerns, float offers both an internal X509 PKI for mutual service authentication, and an integration with ACME services such as Letsencrypt for user-facing SSL certificates.
Internal mTLS PKI
Service communication should be encrypted, and communicating services should authenticate each other. One of the ways to do this is with TLS as the transport layer. Float provides its own service PKI to automatically generate X509 credentials for all services.
The X509 certificates are deployed on the host filesystem, and access to them is controlled via UNIX permissions (using a dedicated group, which the service user is a member of). This provides an attestation of UNIX identity across the whole infrastructure.
Each service, in services.yml, can define multiple credentials, each with its own name and attributes: this can be useful for complex services with multiple processes, but in most cases there will be just a single credential, with the same name as the service. When multiple credentials are used, all server certificates will have the same DNS names (those associated with the service), so it's unusual to have multiple server credentials in a service specification.
Credentials are saved below /etc/credentials/x509/<name>
, with
the following structure:
/etc/credentials/x509/<name>/
+-- ca.pem CA certificate for the service PKI
+-- client/
| +-- cert.pem Client certificate
| \-- private_key.pem Client private key
\-- server/
+-- cert.pem Server certificate
\-- private_key.pem Server private key
Private keys are stored unencrypted, and are only readable by the
<name>-credentials
group. The user that the service runs as must be
a member of this group.
Server certificates will include all the names and IP addresses that service backends are reachable as. This includes:
- service_name.domain
- service_name
- hostname.service_name.domain
- hostname.service_name
- shard.service_name.domain (if present)
- fqdn
- localhost
- all public IP addresses of the host
- all IP addresses of the host on its network overlays
The purpose is to pass server name validation on the largest number of clients possible, without forcing a specific implementation.
Client certificates have the following names, note that it is using the credentials name, not the service name:
- name.domain
- name
Using multiple client credentials for a single service might allow ACL separation in complex services.
Most legacy services should be able to implement CA-based client certificate validation, which at least protects the transport from unprivileged observers. But some clients can validate the client certificate CN, which implements a form of distributed UNIX permission check (the client had access to a specific certificate), and is therefore preferable.
Public credentials
Float runs an ACME client to generate valid SSL certificates for public-facing HTTPS domains associated with a service.
Since these SSL certificates are relatively short-lived, the ACME mechanics run online on the target infrastructure: certificates are continuously renewed, not only when you run Ansible.
SSL certificates are normally only consumed by the frontend float service, where incoming traffic is SSL-terminated by the traffic routers; internal services run with certificates from the internal PKI for mutual authentication with the traffic routers. However this is only the case for HTTP-based services: float does not currently offer SSL termination for other protocols, in which case the SSL connections will be forwarded directly to the backend service, which then needs access to the public SSL certificates. A dedicated mechanism is provided so that a service can "request" a local copy of the certificates, and be reloaded when it is updated.
Configuration
Most services won't be configurable just with environment variables, and are going to require some sort of configuration file. Float has no facilities for specifying configuration file contents in the service description metadata itself: this responsibility is delegated to Ansible. An Ansible role, associated with the service, should be used to create the necessary configuration files and other required system-level setup for the service.
services.yml
myservice:
containers:
- name: http
image: myservice:stable
volumes:
- /etc/myservice.conf: /etc/myservice.conf
roles/myservice/tasks/main.yml
- template:
src: myservice.conf.j2
dest: /etc/myservice.conf
group: docker-myservice
mode: 0640
roles/myservice/templates/myservice.conf.j2
# Just an example of an Ansible template, with no particular meaning.
domain={{ domain }}
The Ansible role then needs to be explicitly associated to the hosts running the service instances via the Ansible playbook (unfortunately float can't automatically generate this association itself):
- hosts: myservice
roles:
- myservice
This takes advantage of the fact that float defines an Ansible group for each service (with the same name as the service itself), which includes the hosts that the service instances have been scheduled on. Note that since Ansible 2.9, the group names will be "normalized" according to the rules for Python identifiers, i.e. dashes will be turned into underscores.
On the Ansible requirement
Does the above mean you have to learn Ansible in order to use float? Should you be concerned about investing effort into writing a configuration for my service in yet another configuration management system's language? The answer is yes, but to a very limited extent:
-
You do need knowledge of how to set up an Ansible environment: the role of
ansible.cfg
, how to structuregroup_vars
etc. Writing a dedicated configuration push system for float was surely an option, but we preferred relying on a popular existing ecosystem for this, both for convenience of implementation and also to allow a migration path of co-existence for legacy systems. To counter-balance, float tries to keep its usage of Ansible as limited as possible, to allow eventual replacement. -
Most services will only need an extremely simple Ansible role to generate the service configuration, normally a mix of template and copy tasks, which are possibly the most basic functionality of any configuration management system. This should guarantee a certain ease of portability to other mechanisms, should one decide to migrate away from float. Besides, it is a good sanity check: if your service requires complicated setup steps, perhaps it might be possible to move some of that complexity inside the service containers.
To emphasize portability, it might be wise to adhere to the following rules when writing Ansible roles:
- Try to use only copy, file and template tasks, rather than complex Ansible modules;
- avoid using complex conditional logic or loops in your Ansible tasks
- keep the configuration "local" to the service: do not reference other services except using the proper service discovery APIs (DNS), do not try to look up configuration attributes for other services (instead make those into global configuration variables);
- do not use facts from other hosts that need to be discovered (these break if you are not using a fact cache when doing partial runs): instead, define whatever host attributes you need, explicitly, in the inventory;
More generally, the integration with Ansible as the underlying configuration management engine is the "escape hatch" that allows the implemention of setups that are not explicitly modeled by float itself.
Infrastructure Part 1: Base Layer
We can subdivide what is done by float in two separate sections: operations and services affecting every host, the so-called "base" layer of infrastructure, and then the fundamental services that are part of the "cluster-level" infrastructure (logging, monitoring, authentication, etc): the latter are part of float but run on the base layer itself as proper services, with their own descriptions and Ansible roles to configure them.
Note that, in its default setup, float will naturally assume a two-tier service topology, with "frontend" hosts handling traffic routing in a stateless fashion, and "backend" hosts running the actual services. The default services.yml.default service description file literally expects the frontend and backend Ansible groups to be defined in your inventory. However, these are just roles, and there is nothing inherent in float that limits you to this kind of topology.
Service Discovery
"How do services find and talk to each other" is a fundamental aspect of any infrastructural platform. Float offers the following features:
- The ability to set up overlay networks to isolate service-to-service traffic from the public Internet.
- Services find each other with DNS A / AAAA lookups, so the client must know the target port. As a consequence, each service must use a globally unique port. This also implies that it's impossible to schedule more than one instance of a service on each host.
- DNS views are used to provide topology-aware service resolution, so that hosts sharing a network overlay will route service requests over that network.
- Connections between services are direct, not mediated by proxies, so there is no global load balancing and clients are expected to keep track of the state of backends and implement retry policies.
- Services can securely authenticate each other by using credentials automatically provided by the service mesh.
Float's implementation of this mechanism is extremely trivial and it is based on writing static entries to /etc/hosts. It is fundamentally limited in the number of services and hosts it can support.
Naming
Services are identified by their name, an alphanumeric string (it can also include dash '-' characters).
All DNS entries are served under an internal domain domain.
Every host has its own view of the DNS map. The specific IP addresses associated with a target service instance will depend on whether the source and target host share any network overlays, which will be used in preference to the public IP address of the backend host.
Locating service backends
The access patterns to backends of a distributed service vary depending on the service itself: for instance, with services that are replicated for high-availability, the client usually does not care which backend it talks to. In other cases, such as with partitioned services, clients need to identify individual backends.
We provide three ways of discovering the IP address of service backends. The port must be known and fixed at the application level.
Note that in all cases, the DNS map returns the configured state of the services, regardless of their health. It is up to the client to keep track of the availability status of the individual backends.
All backends
The DNS name for service.domain results in a response containing the IP addresses of all configured backends for service.
$ getent hosts myservice.mydomain
1.2.3.4
2.3.4.5
3.4.5.6
Note that due to limitations of the DNS protocol, not all backends may be discovered this way. It is however expected that a sufficient number of them will be returned in the DNS response to make high availability applications possible. If you need the full list of instances, it is best to obtain it at configuration time via Ansible.
Individual backends
Each service instance has a name that identifies it specifically, obtained by prepending the (short) host name to the service name:
$ getent hosts host1.myservice.mydomain
1.2.3.4
This is the hostname that the instance should use to advertise itself to its peers, if the service requires it.
Shards
Backends can also have permanent shard identifiers, that identify a specific backend host, and that do not change on reschedules. These are useful when a service is partitioned across multiple backends and the hosts have state or data associated with it. A shard identifier is an alphanumeric literal, specific to the host.
$ getent hosts shard1.myservice.mydomain
1.2.3.4
Master-elected services
When a service uses master election, an instance is automatically picked at configuration time to be the master of the service. This instance will be discoverable along with the other instances when resolving the service name. In addition, the special DNS name service-master.domain will point at it:
$ getent hosts myservice-master.mydomain
2.3.4.5
Network Overlay
It is possible to define internal networks that span multiple hosts, called overlays, which can then be used for service-to-service traffic, ignoring the details of the actual underlying public network topology.
For now, only a single IPv4 address can be assigned to a host on each private network. In the future, it should be possible to assign an entire subnet, so that individual IPs will be available to services.
The list of network overlays is part of the global float
configuration, and to make a host participate in a network one should
simply define a ip_<network-name>
attribute for that host in the
Ansible inventory, whose value should be the desired IP address.
The current implementation of private networking uses tinc and sets up a fully connected mesh between participating hosts. The result is robust and has limited performance overhead.
When the client and server hosts are on the same private network, the DNS-based service discovery will return the server's address on that private network, ensuring that service-to-service communication goes over the VPN.
Traffic Routing
While it's possible to configure it to do otherwise, float assumes that your services will run on its isolated, internal private networks, and it provides a mechanism to expose them publicly and route external traffic to the correct backend processes.
In the expected setup, one or more hosts should be dedicated to running the built-in frontend service (usually by setting up a host group and setting the service scheduling_group accordingly). Such hosts will have their public IP addresses advertised to the world via DNS. The frontend service runs a set of request routers, or reverse proxies (NGINX and HAproxy), to route requests to the correct service backends.
High-level traffic flow
Float uses a basic two-tier model for serving requests, with a reverse proxy layer between users and the backend applications. Traffic to the reverse proxies themselves (hosts running the frontend service) is controlled via DNS: float automatically creates low-TTL DNS records for its public endpoints. This has all the usual caveats of using DNS for this purpose, and it isn't really meant as a precise load-balancing layer.
Reliability is then provided by having multiple backends for the application itself: the reverse proxies will find one that works. It is important to note that, at the moment, float provides no accurate load-balancing whatsoever, just basic round-robin or random-selection: in NGINX, proper load balancing mechanisms are a paid feature.
HTTP
The infrastructure provides a way for HTTP-based services to expose themselves to the public Internet by defining public endpoints. The public HTTP router (NGINX) will be automatically configured based on such service metadata.
The clients of this service are users (or generically, external clients), not other services, which should instead talk directly to each other.
The public HTTP router will force all incoming requests to HTTPS.
For implementation details, see the nginx Ansible role README.
SSL Certificates
Float will automatically generate SSL certificates for the required public domain names. However, on first install, to ensure that NGINX can start while the ACME automation acquires the valid certificates, it will set up self-signed certificates, and switch to the ACME ones when they are available.
HTTP Cache
A global HTTP cache is available for services that require it.
NGINX will set the X-Cache-Status header on responses, so you can check if the response was cached or not.
The cache TTL is low (10 minutes), and there is currently no mechanism to explicitly purge the cache.
Controlling incoming HTTP traffic
The public HTTP router offers the possibility to block incoming requests based on their User-Agent (to ban bots, etc), or based on the URL they are trying to access. The latter is often required for regulatory compliance.
There is documentation of this functionality in the README files below the roles/float-infra-nginx/templates/config/block/ directory.
Non-HTTP
It is also possible to route arbitrary TCP traffic from the frontend hosts to the service backends. In this case, the proxy will not terminate SSL traffic or otherwise manipulate the request. The original client IP address will be unavailable to the service.
Define public_tcp_endpoints for a service to enable this feature.
Note that there is no functionality for reverse proxying UDP services: in this scenario you are probably better off scheduling your UDP service directly on the frontend group (or use a different group altogether and take care of DNS manually).
Public DNS
Float offers authoritative DNS service, it is part of the frontend service so it will run on the same hosts as the HTTP reverse proxies.
DNS entries are automatically generated for all known public_endpoints, as well as for the "public" domains in domain_public.
The DNS server is currently Bind, and is itself configured via an intermediate YAML-based language that supports templates and inheritance called zonetool.
There is the option of configuring DNSSEC (TODO: add docs).
Customizing DNS
If you want to set up a custom DNS zone, one way to do so is with a dedicated Ansible role (to be run on hosts in the frontend group) that installs your desired zonetool configuration.
Let's walk through a complete example: suppose we have a service myservice that should serve HTTP requests for the myservice.org domain. This doesn't match the service_name.domain scheme that is expected for services described in services.yml, so float won't automatically generate its DNS configuration.
What we need to do is set up the myservice.org DNS zone ourselves, and then tell float to associate that domain to the myservice service.
First, we create a new Ansible role that we are going to call myservice-dns, so in the root of your Ansible config:
$ mkdir -p roles/myservice-dns/{handlers,tasks,templates}
The only task in the role should install a zonetool DNS configuration file into /etc/dns/manual, so in roles/myservice-dns/tasks/main.yml we'll have:
---
- name: Install myservice DNS configuration
template:
src: myservice.yml.j2
dest: /etc/dns/manual/myservice.yml
notify: reload DNS
The DNS configuration in our case is very simple and just points "www" and the top-level domain at the frontends. We do so by extending the @base zone template defined by float. The contents of roles/myservice-dns/templates/myservice.yml.j2 should be:
---
myservice.org:
EXTENDS: "@base"
www: CNAME www.{{ domain_public[0] }}.
This points the www domain at the frontends via a CNAME (all the domain_public DNS zones are already autogenerated by float). We could have just as easily used A records but this is simpler and works with both IPv4 and IPv6.
Finally, we need a handler to reload the updated DNS configuration, which goes in roles/myservice-dns/handlers/main.yml and runs a shell command to update zonetool:
---
- listen: reload DNS
shell: "/usr/sbin/update-dns && rndc reload"
With the above we have a complete Ansible role that configures DNS for the myservice.org domain. We need to tell Ansible that this role needs to run on the hosts in the frontend group, so in your playbook you should have:
- hosts: frontend
roles:
- myservice-dns
And to complete our configuration, the service description for myservice should have a public_endpoint directive including the domain, so that the float HTTP router knows where to send the requests:
myservice:
...
public_endpoints:
- name: myservice
domains:
- www.myservice.org
- myservice.org
port: ...
SSL
The internal ACME service continuously monitors the configured list of public domains and attempts to create or renew valid SSL certificates for them using Letsencrypt. It is integrated with the HTTP reverse proxy, so it will use the http-01 ACME validation protocol, meaning that it is only able to create certificates for domains that have an A record pointing to float's frontend hosts.
To prevent issues with starting up daemons and missing certificates, float will at first generate placeholder self-signed certificates, so that services can use them even before the ACME automation has had a chance to create valid ones.
The certificates created by the ACME service are then replicated to all frontend hosts via the replds daemon, eventually ending up in the /etc/credentials/public directory.
If a service that is not running on the frontend hosts needs access to the certificates, it can do so by depending on the float-infra-acme-storage role, e.g.:
roles/myservice/meta/main.yml
---
dependencies:
- {role: float-infra-acme-storage}
which will again ensure that the SSL certificates are present on the local host's /etc/credentials/public directory.
Access to the SSL certificates is controlled by membership in the public-credentials UNIX group.
If a service needs to be reloaded when its certificates change, it should install a shell script hook in the /etc/acme-storage/reload-hooks directory. This script will be invoked every time any certificate changes, which is why the script should inspect whether the specific certificate it cares about has changed or not (possibly using something like the if-changed tool), to avoid excessive reloads:
#!/bin/sh
if-changed /etc/credentials/public/my.dom.ain/ \
&& systemctl restart myservice
exit 0
Generating additional SSL certificates
To customize the ACME server configuration, use a dedicated Ansible role that runs on the same group as the acme service, and dump a configuration file in /etc/acme/certs.d:
roles/myservice-acme/tasks/main.yml
- name: Configure ACME for my custom domain
copy:
dest: /etc/acme/certs.d/mydomain.yml
content: |
- names:
- "my.dom.ain"
playbook.yml
- hosts: acme
roles:
- myservice-acme
SSH
Float can take over the SSH configuration of the managed hosts, and perform the following tasks:
- create a SSH Certification Authority
- sign the SSH host keys of all hosts with that CA
- add all the admin users' ssh_keys to the authorized_key list for the root user on all hosts.
The underlying access model is very simple and expects admins to log in as root in order to run Ansible, so you'll most likely want to set ansible_user=root and ansible_become=false in your config as well.
Keys used for login will be logged in the audit log, so you can still tell admins apart.
SSH Client Setup
You will find the public key for this CA in the credentials/ssh/key.pub file, it will be created the first time you run the init-credentials playbook.
Assuming that all your target hosts share the same domain (so you can use a wildcard), you should add the following entry to ~/.ssh/known_hosts:
@cert_authority *.example.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAA....
Since all logins happen as root, it may be convenient to also add a section to your ~/.ssh/config file like the following:
Host *.example.com
User root
Integrating base services with other automation
Most float services that deal with config-driven autogenerated configurations support integrating with other, presumably service-driven, types of automation.
Consider, for example, the case of a platform for user web hosting: the main HTTP traffic routing infrastructure has to be extended with the configuration for all the user domains, which presumably comes out of a database somewhere.
In order to support such integration, services such as the HTTP router, DNS, ACME, and others, will also read their configurations from an auto directory (such as /etc/nginx/sites-auto for example), which is not managed by float at all and that can be delegated to the external automation.
Infrastructure Part 2: Cluster Services
Authentication and Identity
The float infrastructure provides a full AAA solution that is used by all the built-in services, and that can be easily integrated with your services (or at least that would be the intention). It aims to implement modern solutions, and support moderately complex scenarios, while keeping things as simple as possible -- an area that could still see some improvement. It offers the following features:
- supports users and groups (mostly admins and eventually users)
- supports multiple backends (file, LDAP, SQL, ...)
- mechanisms for account recovery (currently poor, via secondary password, other mechanisms should be implemented)
- transparent upgrade of password hashing mechanisms (for future-proofing) (somewhat TODO)
- single sign-on for HTTP services
- TOTP and U2F authentication mechanisms for HTTP services
- supports passwords tied to specific services (wrongly called application-specific) for non-HTTP services
- manages secrets (encryption keys) encrypted with the user password, in a way that works even over single sign-on
- supports partitioned services
- configurable rate limits and blacklists for brute-force protection
- tracks logins and user devices without storing PII
- it is modular, and can be adapted to the desired scale / shape
However it is important to note that it comes with a very long list of caveats as well:
- the single sign-on system is implemented with bearer tokens (signed
HTTP cookies), which have numerous weaknesses, even if one ignores
the possible implementation failures:
- bearer tokens are vulnerable to exfiltration (in logs, in user browser histories, caches, etc.), which can be partially mitigated by short token lifetimes
- logout is a somewhat ill-defined operation (the current implementation relies on borderline-adtech techniques in order to delete cookies on other services' domains)
- they rely on a complex chain of HTTP redirects and HTTP headers being set in the right place
Most of these features do not have immediate use in the basic services built-in into the infrastructure, but they are meant instead for the primary use case for float: the implementation of a large-ish email and hosting provider.
It should therefore be clear that the chosen design involves numerous trade-offs, some of which we have tried to document here, that are tailored to the above use case, and might very well not be suitable to your particular scenario.
In float, the primary user authentication database is provided via a global variable in your Ansible configuration and controls access to the internal web-based services that are behind single sign-on.
Authentication
All credentials-based authentication (passwords, OTP, U2F) goes through the main authentication daemon auth-server. It translates authentication requests, containing service name, user name, password, and other authentication parameters, into database requests to retrieve the authentication primaries and verify them.
An authentication response has one of three possible states: failure, success, and the request for further authentication with a second factor (OTP or U2F, in which case the response will also contain U2F challenge parameters). On a successful response, the auth-server might return additional data such as an email address. The auth-server listens on a UNIX socket, so it usually runs on all machines, and speaks a simple line-based protocol. There is also a PAM module available to help integrate your services.
Database lookup queries can be configured separately for each supported service, along with a number of other parameters.
The default setup in float uses a file-based backend for administrator accounts (in the admin group), and eventually a LDAP database for user accounts (LDAP was a requirement of the main float use case, SQL support should be added instead).
The auth-server can log authentication events and the associated client and device information to long-term storage, and it can detect anomalies and take action (the standard use case is "send an email when you see a login from a new device").
Why not PAM? PAM is not exactly a nice interface, furthermore it isn't exactly easy to pass arbitrary information through its conversation API (required for OTP/U2F). Furthermore, there are many advantages in using a standalone authentication server: centralization of rate limits across different services, a single point for logging, auditing and monitoring, and a single ownership of database authorization credentials.
References