docs(adrs): move decision folder into docs to make them publicly visible within docs

This commit is contained in:
Johannes Kirschbauer
2025-05-14 10:06:24 +02:00
parent 5a7c49ad82
commit a1d2948914
7 changed files with 37 additions and 13 deletions

View File

@@ -0,0 +1,548 @@
# Clan service modules
## Status
Accepted
## Context
To define a service in Clan, you need to define two things:
- `clanModule` - defined by module authors
- `inventory` - defined by users
The `clanModule` is currently a plain NixOS module. It is conditionally imported into each machine depending on the `service` and `role`.
A `role` is a function of a machine within a service. For example in the `backup` service there are `client` and `server` roles.
The `inventory` contains the settings for the user/consumer of the module. It describes what `services` run on each machine and with which `roles`.
Additionally any `service` can be instantiated multiple times.
This ADR proposes that we change how to write a `clanModule`. The `inventory` should get a new attribute called `instances` that allow for configuration of these modules.
### Status Quo
In this example the user configures 2 instances of the `networking` service:
The *user* defines
```nix
{
inventory.services = {
# anything inside an instance is instance specific
networking."instance1" = {
roles.client.tags = [ "all" ];
machines.foo.config = { ... /* machine specific settings */ };
# this will not apply to `clients` outside of `instance1`
roles.client.config = { ... /* client specific settings */ };
};
networking."instance2" = {
roles.server.tags = [ "all" ];
config = { ... /* applies to every machine that runs this instance */ };
};
};
}
```
The *module author* defines:
```nix
# networking/roles/client.nix
{ config, ... }:
let
instances = config.clan.inventory.services.networking or { };
serviceConfig = config.clan.networking;
in {
## Set some nixos options
}
```
### Problems
Problems with the current way of writing clanModules:
1. No way to retrieve the config of a single service instance, together with its name.
2. Directly exporting a single, anonymous nixosModule without any intermediary attribute layers doesn't leave room for exporting other inventory resources such as potentially `vars` or `homeManagerConfig`.
3. Can't access multiple config instances individually.
Example:
```nix
inventory = {
services = {
network.c-base = {
instanceConfig.ips = {
mors = "172.139.0.2";
};
};
network.gg23 = {
instanceConfig.ips = {
mors = "10.23.0.2";
};
};
};
};
```
This doesn't work because all instance configs are applied to the same namespace. So this results in a conflict currently.
Resolving this problem means that new inventory modules cannot be plain nixos modules anymore. If they are configured via `instances` / `instanceConfig` they cannot be configured without using the inventory. (There might be ways to inject instanceConfig but that requires knowledge of inventory internals)
4. Writing modules for multiple instances is cumbersome. Currently the clanModule author has to write one or multiple `fold` operations for potentially every nixos option to define how multiple service instances merge into every single one option. The new idea behind this adr is to pull the common fold function into the outer context provide it as a common helper. (See the example below. `perInstance` analog to the well known `perSystem` of flake-parts)
5. Each role has a different interface. We need to render that interface into json-schema which includes creating an unnecessary test machine currently. Defining the interface at a higher level (outside of any machine context) allows faster evaluation and an isolation by design from any machine.
This allows rendering the UI (options tree) of a service by just knowing the service and the corresponding roles without creating a dummy machine.
6. The interface of defining config is wrong. It is possible to define config that applies to multiple machine at once. It is possible to define config that applies to
a machine as a hole. But this is wrong behavior because the options exist at the role level. So config must also always exist at the role level.
Currently we merge options and config together but that may produce conflicts. Those module system conflicts are very hard to foresee since they depend on what roles exist at runtime.
## Proposed Change
We will create a new module class which is defined by `_class = "clan.service"` ([documented here](https://nixos.org/manual/nixpkgs/stable/#module-system-lib-evalModules-param-class)).
Existing clan modules will still work by continuing to be plain NixOS modules. All new modules can set `_class = "clan.service";` to use the proposed features.
In short the change introduces a new module class that makes the currently necessary folding of `clan.service`s `instances` and `roles` a common operation. The module author can define the inner function of the fold operations which is called a `clan.service` module.
There are the following attributes of such a module:
### `roles.<roleName>.interface`
Each role can have a different interface for how to be configured.
I.e.: A `client` role might have different options than a `server` role.
This attribute should be used to define `options`. (Not `config` !)
The end-user defines the corresponding `config`.
This submodule will be evaluated for each `instance role` combination and passed as argument into `perInstance`.
This submodules `options` will be evaluated to build the UI for that module dynamically.
### **Result attributes**
Some common result attributes are produced by modules of this proposal, those will be referenced later in this document but are commonly defined as:
- `nixosModule` A single nixos module. (`{config, ...}:{ environment.systemPackages = []; }`)
- `services.<serviceName>` An attribute set of `_class = clan.service`. Which contain the same thing as this whole ADR proposes.
- `vars` To be defined. Reserved for now.
### `roles.<roleName>.perInstance`
This acts like a function that maps over all `service instances` of a given `role`.
It produces the previously defined **result attributes**.
I.e. This allows to produce multiple `nixosModules` one for every instance of the service.
Hence making multiple `service instances` convenient by leveraging the module-system merge behavior.
### `perMachine`
This acts like a function that maps over all `machines` of a given `service`.
It produces the previously defined **result attributes**.
I.e. this allows to produce exactly one `nixosModule` per `service`.
Making it easy to set nixos-options only once if they have a one-to-one relation to a service being enabled.
Note: `lib.mkIf` can be used on i.e. `roleName` to make the scope more specific.
### `services.<serviceName>`
This allows to define nested services.
i.e the *service* `backup` might define a nested *service* `ssh` which sets up an ssh connection.
This can be defined in `perMachine` and `perInstance`
- For Every `instance` a given `service` may add multiple nested `services`.
- A given `service` may add a static set of nested `services`; Even if there are multiple instances of the same given service.
Q: Why is this not a top-level attribute?
A: Because nested service definitions may also depend on a `role` which must be resolved depending on `machine` and `instance`. The top-level module doesn't know anything about machines. Keeping the service layer machine agnostic allows us to build the UI for a module without adding any machines. (One of the problems with the current system)
```
zerotier/default.nix
```
```nix
# Some example module
{
_class = "clan.service";
# Analog to flake-parts 'perSystem' only that it takes instance
# The exact arguments will be specified and documented along with the actual implementation.
roles.client.perInstance = {
# attrs : settings of that instance
settings,
# string : name of the instance
instanceName,
# { name :: string , roles :: listOf string; }
machine,
# { {roleName} :: { machines :: listOf string; } }
roles,
...
}:
{
# Return a nixos module for every instance.
# The module author must be aware that this may return multiple modules (one for every instance) which are merged natively
nixosModule = {
config.debug."${instanceName}-client" = instanceConfig;
};
};
# Function that is called once for every machine with the role "client"
# Receives at least the following parameters:
#
# machine :: { name :: String, roles :: listOf string; }
# Name of the machine
#
# instances :: { instanceName :: { roleName :: { machines :: [ string ]; }}}
# Resolved roles
# Same type as currently in `clan.inventory.services.<ServiceName>.<InstanceName>.roles`
#
# The exact arguments will be specified and documented along with the actual implementation.
perMachine = {machine, instances, ... }: {
nixosModule =
{ lib, ... }:
{
# Some shared code should be put into a shared file
# Which is then imported into all/some roles
imports = [
../shared.nix
] ++
(lib.optional (builtins.elem "client" machine.roles)
{
options.debug = lib.mkOption {
type = lib.types.attrsOf lib.types.raw;
};
});
};
};
}
```
## Inventory.instances
This document also proposes to add a new attribute to the inventory that allow for exclusive configuration of the new modules.
This allows to better separate the new and the old way of writing and configuring modules. Keeping the new implementation more focussed and keeping existing technical debt out from the beginning.
The following thoughts went into this:
- Getting rid of `<serviceName>`: Using only the attribute name (plain string) is not sufficient for defining the source of the service module. Encoding meta information into it would also require some extensible format specification and parser.
- removing instanceConfig and machineConfig: There is no such config. Service configuration must always be role specific, because the options are defined on the role.
- renaming `config` to `settings` or similar. Since `config` is a module system internal name.
- Tags and machines should be an attribute set to allow setting `settings` on that level instead.
```nix
{
inventory.instances = {
"instance1" = {
# Allows to define where the module should be imported from.
module = {
input = "clan-core";
name = "borgbackup";
};
# settings that apply to all client machines
roles.client.settings = {};
# settings that apply to the client service of machine with name <machineName>
# There might be a server service that takes different settings on the same machine!
roles.client.machines.<machineName>.settings = {};
# settings that apply to all client-instances with tag <tagName>
roles.client.tags.<tagName>.settings = {};
};
"instance2" = {
# ...
};
};
}
```
## Iteration note
We want to implement the system as described. Once we have sufficient data on real world use-cases and modules we might revisit this document along with the updated implementation.
## Real world example
The following module demonstrates the idea in the example of *borgbackup*.
```nix
{
_class = "clan.service";
# Define the 'options' of 'settings' see argument of perInstance
roles.server.interface =
{ lib, ... }:
{
options = lib.mkOption {
type = lib.types.str;
default = "/var/lib/borgbackup";
description = ''
The directory where the borgbackup repositories are stored.
'';
};
};
roles.server.perInstance =
{
instanceName,
settings,
roles,
...
}:
{
nixosModule =
{ config, lib, ... }:
let
dir = config.clan.core.settings.directory;
machineDir = dir + "/vars/per-machine/";
allClients = roles.client.machines;
in
{
# services.borgbackup is a native nixos option
config.services.borgbackup.repos =
let
borgbackupIpMachinePath = machine: machineDir + machine + "/borgbackup/borgbackup.ssh.pub/value";
machinesMaybeKey = builtins.map (
machine:
let
fullPath = borgbackupIpMachinePath machine;
in
if builtins.pathExists fullPath then
machine
else
lib.warn ''
Machine ${machine} does not have a borgbackup key at ${fullPath},
run `clan var generate ${machine}` to generate it.
'' null
) allClients;
machinesWithKey = lib.filter (x: x != null) machinesMaybeKey;
hosts = builtins.map (machine: {
name = instanceName + machine;
value = {
path = "${settings.directory}/${machine}";
authorizedKeys = [ (builtins.readFile (borgbackupIpMachinePath machine)) ];
};
}) machinesWithKey;
in
if (builtins.listToAttrs hosts) != [ ] then builtins.listToAttrs hosts else { };
};
};
roles.client.interface =
{ lib, ... }:
{
# There might be a better interface now. This is just how clan borgbackup was configured in the 'old' way
options.destinations = lib.mkOption {
type = lib.types.attrsOf (
lib.types.submodule (
{ name, ... }:
{
options = {
name = lib.mkOption {
type = lib.types.strMatching "^[a-zA-Z0-9._-]+$";
default = name;
description = "the name of the backup job";
};
repo = lib.mkOption {
type = lib.types.str;
description = "the borgbackup repository to backup to";
};
rsh = lib.mkOption {
type = lib.types.nullOr lib.types.str;
default = null;
defaultText = "ssh -i \${config.clan.core.vars.generators.borgbackup.files.\"borgbackup.ssh\".path} -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null";
description = "the rsh to use for the backup";
};
};
}
)
);
default = { };
description = ''
destinations where the machine should be backed up to
'';
};
options.exclude = lib.mkOption {
type = lib.types.listOf lib.types.str;
example = [ "*.pyc" ];
default = [ ];
description = ''
Directories/Files to exclude from the backup.
Use * as a wildcard.
'';
};
};
roles.client.perInstance =
{
instanceName,
roles,
machine,
settings,
...
}:
{
nixosModule =
{
config,
lib,
pkgs,
...
}:
let
allServers = roles.server.machines;
# machineName = config.clan.core.settings.machine.name;
# cfg = config.clan.borgbackup;
preBackupScript = ''
declare -A preCommandErrors
${lib.concatMapStringsSep "\n" (
state:
lib.optionalString (state.preBackupCommand != null) ''
echo "Running pre-backup command for ${state.name}"
if ! /run/current-system/sw/bin/${state.preBackupCommand}; then
preCommandErrors["${state.name}"]=1
fi
''
) (lib.attrValues config.clan.core.state)}
if [[ ''${preCommandErrors[@]} -gt 0 ]]; then
echo "pre-backup commands failed for the following services:"
for state in "''${!preCommandErrors[@]}"; do
echo " $state"
done
exit 1
fi
'';
destinations =
let
destList = builtins.map (serverName: {
name = "${instanceName}-${serverName}";
value = {
repo = "borg@${serverName}:/var/lib/borgbackup/${machine.name}";
rsh = "ssh -i ${
config.clan.core.vars.generators."borgbackup-${instanceName}".files."borgbackup.ssh".path
} -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=Yes";
} // settings.destinations.${serverName};
}) allServers;
in
(builtins.listToAttrs destList);
in
{
config = {
# Derived from the destinations
systemd.services = lib.mapAttrs' (
_: dest:
lib.nameValuePair "borgbackup-job-${instanceName}-${dest.name}" {
# since borgbackup mounts the system read-only, we need to run in a ExecStartPre script, so we can generate additional files.
serviceConfig.ExecStartPre = [
''+${pkgs.writeShellScript "borgbackup-job-${dest.name}-pre-backup-commands" preBackupScript}''
];
}
) destinations;
services.borgbackup.jobs = lib.mapAttrs (_destinationName: dest: {
paths = lib.unique (
lib.flatten (map (state: state.folders) (lib.attrValues config.clan.core.state))
);
exclude = settings.exclude;
repo = dest.repo;
environment.BORG_RSH = dest.rsh;
compression = "auto,zstd";
startAt = "*-*-* 01:00:00";
persistentTimer = true;
encryption = {
mode = "repokey";
passCommand = "cat ${config.clan.core.vars.generators."borgbackup-${instanceName}".files."borgbackup.repokey".path}";
};
prune.keep = {
within = "1d"; # Keep all archives from the last day
daily = 7;
weekly = 4;
monthly = 0;
};
}) destinations;
environment.systemPackages = [
(pkgs.writeShellApplication {
name = "borgbackup-create";
runtimeInputs = [ config.systemd.package ];
text = ''
${lib.concatMapStringsSep "\n" (dest: ''
systemctl start borgbackup-job-${dest.name}
'') (lib.attrValues destinations)}
'';
})
(pkgs.writeShellApplication {
name = "borgbackup-list";
runtimeInputs = [ pkgs.jq ];
text = ''
(${
lib.concatMapStringsSep "\n" (
dest:
# we need yes here to skip the changed url verification
''echo y | /run/current-system/sw/bin/borg-job-${dest.name} list --json | jq '[.archives[] | {"name": ("${dest.name}::${dest.repo}::" + .name)}]' ''
) (lib.attrValues destinations)
}) | jq -s 'add // []'
'';
})
(pkgs.writeShellApplication {
name = "borgbackup-restore";
runtimeInputs = [ pkgs.gawk ];
text = ''
cd /
IFS=':' read -ra FOLDER <<< "''${FOLDERS-}"
job_name=$(echo "$NAME" | awk -F'::' '{print $1}')
backup_name=''${NAME#"$job_name"::}
if [[ ! -x /run/current-system/sw/bin/borg-job-"$job_name" ]]; then
echo "borg-job-$job_name not found: Backup name is invalid" >&2
exit 1
fi
echo y | /run/current-system/sw/bin/borg-job-"$job_name" extract "$backup_name" "''${FOLDER[@]}"
'';
})
];
# every borgbackup instance adds its own vars
clan.core.vars.generators."borgbackup-${instanceName}" = {
files."borgbackup.ssh.pub".secret = false;
files."borgbackup.ssh" = { };
files."borgbackup.repokey" = { };
migrateFact = "borgbackup";
runtimeInputs = [
pkgs.coreutils
pkgs.openssh
pkgs.xkcdpass
];
script = ''
ssh-keygen -t ed25519 -N "" -f $out/borgbackup.ssh
xkcdpass -n 4 -d - > $out/borgbackup.repokey
'';
};
};
};
};
perMachine = {
nixosModule =
{ ... }:
{
clan.core.backups.providers.borgbackup = {
list = "borgbackup-list";
create = "borgbackup-create";
restore = "borgbackup-restore";
};
};
};
}
```
## Prior-art
- https://github.com/NixOS/nixops
- https://github.com/infinisil/nixus

View File

@@ -0,0 +1,114 @@
# Clan as library
## Status
Accepted
## Context
In the long term we envision the clan application will consist of the following user facing tools in the long term.
- `CLI`
- `TUI`
- `Desktop Application`
- `REST-API`
- `Mobile Application`
We might not be sure whether all of those will exist but the architecture should be generic such that those are possible without major changes of the underlying system.
## Decision
This leads to the conclusion that we should do `library` centric development.
With the current `clan` python code being a library that can be imported to create various tools ontop of it.
All **CLI** or **UI** related parts should be moved out of the main library.
Imagine roughly the following architecture:
``` mermaid
graph TD
%% Define styles
classDef frontend fill:#f9f,stroke:#333,stroke-width:2px;
classDef backend fill:#bbf,stroke:#333,stroke-width:2px;
classDef storage fill:#ff9,stroke:#333,stroke-width:2px;
classDef testing fill:#cfc,stroke:#333,stroke-width:2px;
%% Define nodes
user(["User"]) -->|Interacts with| Frontends
subgraph "Frontends"
CLI["CLI"]:::frontend
APP["Desktop App"]:::frontend
TUI["TUI"]:::frontend
REST["REST API"]:::frontend
end
subgraph "Python"
API["Library <br>for interacting with clan"]:::backend
BusinessLogic["Business Logic<br>Implements actions like 'machine create'"]:::backend
STORAGE[("Persistence")]:::storage
NIX["Nix Eval & Build"]:::backend
end
subgraph "CI/CD & Tests"
TEST["Feature Testing"]:::testing
end
%% Define connections
CLI --> API
APP --> API
TUI --> API
REST --> API
TEST --> API
API --> BusinessLogic
BusinessLogic --> STORAGE
BusinessLogic --> NIX
```
With this very simple design it is ensured that all the basic features remain stable across all frontends.
In the end it is straight forward to create python library function calls in a testing framework to ensure that kind of stability.
Integration tests and smaller unit-tests should both be utilized to ensure the stability of the library.
Note: Library function don't have to be json-serializable in general.
Persistence includes but is not limited to: creating git commits, writing to inventory.json, reading and writing vars, and interacting with persisted data in general.
## Benefits / Drawbacks
- (+) Less tight coupling of frontend- / backend-teams
- (+) Consistency and inherent behavior
- (+) Performance & Scalability
- (+) Different frontends for different user groups
- (+) Documentation per library function makes it convenient to interact with the clan resources.
- (+) Testing the library ensures stability of the underlyings for all layers above.
- (-) Complexity overhead
- (-) library needs to be designed / documented
- (+) library can be well documented since it is a finite set of functions.
- (-) Error handling might be harder.
- (+) Common error reporting
- (-) different frontends need different features. The library must include them all.
- (+) All those core features must be implemented anyways.
- (+) VPN Benchmarking uses the existing library's already and works relatively well.
## Implementation considerations
Not all required details that need to change over time are possible to be pointed out ahead of time.
The goal of this document is to create a common understanding for how we like our project to be structured.
Any future commits should contribute to this goal.
Some ideas what might be needed to change:
- Having separate locations or packages for the library and the CLI.
- Rename the `clan_cli` package to `clan` and move the `cli` frontend into a subfolder or a separate package.
- Python Argparse or other cli related code should not exist in the `clan` python library.
- `__init__.py` should be very minimal. Only init the business logic models and resources. Note that all `__init__.py` files all the way up in the module tree are always executed as part of the python module import logic and thus should be as small as possible.
i.e. `from clan_cli.vars.generators import ...` executes both `clan_cli/__init__.py` and `clan_cli/vars/__init__.py` if any of those exist.
- `api` folder doesn't make sense since the python library `clan` is the api.
- Logic needed for the webui that performs json serialization and deserialization will be some `json-adapter` folder or package.
- Code for serializing dataclasses and typed dictionaries is needed for the persistence layer. (i.e. for read-write of inventory.json)
- The inventory-json is a backend resource, that is internal. Its logic includes merging, unmerging and partial updates with considering nix values and their priorities. Nobody should try to read or write to it directly.
Instead there will be library methods i.e. to add a `service` or to update/read/delete some information from it.
- Library functions should be carefully designed with suitable conventions for writing good api's in mind. (i.e: https://swagger.io/resources/articles/best-practices-in-api-design/)

View File

@@ -0,0 +1,47 @@
# ADR Numbering process
## Status
Proposed after some conversation between @lassulus, @Mic92, & @lopter.
## Context
It can be useful to refer to ADRs by their numbers, rather than their full title. To that end, short and sequential numbers are useful.
The issue is that an ADR number is effectively assigned when the ADR is merged, before being merged its number is provisional. Because multiple ADRs can be written at the same time, you end-up with multiple provisional ADRs with the same number, for example this is the third ADR-3:
1. ADR-3-clan-compat: see [#3212];
2. ADR-3-fetching-nix-from-python: see [#3452];
3. ADR-3-numbering-process: this ADR.
This situation makes it impossible to refer to an ADR by its number, and why I (@lopter) went with the arbitrary number 7 in [#3196].
We could solve this problem by using the PR number as the ADR number (@lassulus). The issue is that PR numbers are getting big in clan-core which does not make them easy to remember, or use in conversation and code (@lopter).
Another approach would be to move the ADRs in a different repository, this would reset the counter back to 1, and make it straightforward to keep ADR and PR numbers in sync (@lopter). The issue then is that ADR are not in context with their changes which makes them more difficult to review (@Mic92).
## Decision
A third approach would be to:
1. Commit ADRs before they are approved, so that the next ADR number gets assigned;
1. Open a PR for the proposed ADR;
1. Update the ADR file committed in step 1, so that its markdown contents point to the PR that tracks it.
## Consequences
### ADR have unique and memorable numbers trough their entire life cycle
This makes it easier to refer to them in conversation or in code.
### You need to have commit access to get an ADR number assigned
This makes it more difficult for someone external to the project to contribute an ADR.
### Creating a new ADR requires multiple commits
Maybe a script or CI flow could help with that if it becomes painful.
[#3212]: https://git.clan.lol/clan/clan-core/pulls/3212/
[#3452]: https://git.clan.lol/clan/clan-core/pulls/3452/
[#3196]: https://git.clan.lol/clan/clan-core/pulls/3196/

View File

@@ -0,0 +1,16 @@
# Architecture Decision Records
This section contains the architecture decisions that have been reviewed and generally agreed upon
## What is an ADR?
> An architecture decision record (ADR) is a document that captures an important architecture decision made along with its context and consequences.
!!! Note
For further reading about adr's we recommend [architecture-decision-record](https://github.com/joelparkerhenderson/architecture-decision-record)
## Crafting a new ADR
1. Use the [template](./_template.md)
2. Create the Pull request and gather feedback
3. Retreive your adr-number (see: [numbering](./03-adr-numbering-process.md))

View File

@@ -0,0 +1,24 @@
# Decision record template by Michael Nygard
This is the template in [Documenting architecture decisions - Michael Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
You can use [adr-tools](https://github.com/npryce/adr-tools) for managing the ADR files.
In each ADR file, write these sections:
# Title
## Status
What is the status, such as proposed, accepted, rejected, deprecated, superseded, etc.?
## Context
What is the issue that we're seeing that is motivating this decision or change?
## Decision
What is the change that we're proposing and/or doing?
## Consequences
What becomes easier or more difficult to do because of this change?