Safe service upgrades using system.stateVersion
By Maximilian Bosch | Thu, 28 Jan 2021
One of the most important features for system administrators who operate NixOS systems are
atomic upgrades which means that a deployment won’t reach
an inconsistent state: if building a new system’s configuration succeeds, it will be activated in
a single step by replacing the /run/current-system
-symlink. If
a build fails, e.g. due to broken packages, the configuration won’t be activated.
This also means that downgrades are fairly simple since a previous configuration can be reactivated
in a so-called rollback by changing the symlink to /run/current-system
back to the previous
store-path. This is
helpful if e.g. the configuration for a service is technically valid, but doesn’t do what it’s
supposed to and thus has to
be removed again. Or as my colleague Robin has said,
“with NixOS you don’t have to be afraid of deploys anymore” (de).
While this model is revolutionary for configurations, it doesn’t really tackle modifications in
the application’s state, such as the structure of a data-directory in e.g. /var/lib
.
A well-known example for this
is postgresql
, where manual intervention can become necessary when upgrading to a newer version.
In order to make sure that a newer configuration doesn’t contain potentially breaking changes
without an explicit opt-in from an administrator, NixOS uses a mechanism named
stateVersion
.
Semantics of system.stateVersion
The stateVersion
is a config-option that contains the version of NixOS that was used when
the machine was initially provisioned:
1{
2 system.stateVersion = "20.09";
3}
Now, let’s say that a new version of postgresql
was released and manual intervention
is needed in order to get it running on existing setups. If the package itself gets updated in
NixOS and will be deployed at some point to this example machine, the database process will break.
In the worst case, existing data will be corrupted when the service gets restarted after the deploy.
To avoid that, traditional config management systems would check e.g. for existing directory
structures on the target system. For instance, the exec
-resource
in Puppet has an onlyif
-option that ensures that the command will only be executed if
a given condition evaluates to true
.
However, this is not possible with NixOS where building a system’s config doesn’t happen at the
time it’s activated. Instead, stateVersion
could be used in the database module in nixpkgs
:
1{ pkgs, lib, config, ... }:
2let
3 inherit (config.system) stateVersion;
4 package = if lib.versionOlder stateVersion "21.05"
5 then pkgs.postgresql_old
6 else pkgs.postgresql_new;
7in {
8 environment.systemPackages = [ package ];
9}
This code decides whether to install postgresql_old
or postgresql_new
in the system
depending on the stateVersion
. So why does this make sense?
- If your
stateVersion
is at20.09
, it can be assumed that NixOS20.09
is the first version installed. - The package
postgresql_new
is added tonixos-unstable
which will turn into the next release (in this case21.05
) at some point. - This means that users of
21.05
and newer don’t have an existing, old version ofpostgresql
. For every machine that ran on20.09
at some point,postgresql_old
will be used to make sure that no existing database will reach a broken state.
While this is a fairly useful concept, it has the problem that it’s not necessarily transparent
to a sysadmin what’s happening inside and which configs will be selected. Also, it’s non-trivial
for 20.09 systems to manually upgrade the database and use the new version after that
without modifying system.stateVersion
.
A more complex example, where this is actually needed and reading release notes is not necessarily sufficient, is Nextcloud.
Designing an upgrade path for Nextcloud
A fairly notable example where this approach becomes necessary is the self-hostable cloud-platform
Nextcloud. This application uses a database, stores files in /var/lib
and has a fairly stateful way of managing itself with configuration files in /var/lib/nextcloud
.
Challenges for packagers
Packaging the Nextcloud service on NixOS turned out to be a non-trivial job. The issues can be summarized into the following two aspects:
-
Nextcloud isn’t really designed to be configured declaratively: instead, their own tool called
occ
is supposed to be used in order to generate PHP code which specifies the configuration. It will be written into the data-directory which is/var/lib/nextcloud
on NixOS.The NixOS module uses
occ
at the first install and writes declarative config into a second PHP file in/var/lib/nextcloud
for some degree of declarativity. However, Nextcloud’s configuration is still heavily tied to the statefulocc
command.For instance, the maintenance mode can only be activated by
occ
. -
Nextcloud doesn’t support upgrades across multiple major releases. For instance, if Nextcloud 18 is installed, it’s impossible to directly go to Nextcloud 20. Instead, an upgrade to Nextcloud 19 is needed first.
It becomes somewhat clear that one should be careful with Nextcloud updates. An accidental update
can result in having to restore a backup and potentially losing data that was written
after the latest backup. Hence, it’s important to carefully integrate
system.stateVersion
into the module to provide safe upgrades.
Let the user decide
First of all I’d like to thank my colleague fpletz who helped me working this out.
To make the latest version available without forcing users to upgrade, a package is available
for each supported major release. At the time of writing these
are nextcloud18
, nextcloud19
and nextcloud20
.
As a first measure, selecting the effective package can be done by the administrator with
the option services.nextcloud.package
.
Due to that, the module doesn’t force the administrator using the stateVersion
to use a
certain package version.
But how does stateVersion
come into play here? When services.nextcloud.package
is set, it’s up to the administrator. But if it doesn’t a default version will be determined
and set as default using mkDefault
according to the following scheme:
- If a Nextcloud version was released before e.g.
20.09
, it will be selected by default for everystateVersion
below20.09
. - If a new Nextcloud version is supposed to be added to
nixpkgs
, it will be default on NixOS unstable and the next upcoming NixOS release.
So, a simplified version of the expression in the Nextcloud module would look like this:
1{ config, lib, pkgs, ... }: with lib;
2
3let inherit (config.system) stateVersion; in
4{
5 services.nextcloud.package = with pkgs;
6 mkDefault (
7 if versionOlder stateVersion "20.03" then nextcloud17
8 else if versionOlder stateVersion "20.09" then nextcloud18
9 /* ... */
10 else nextcloud20
11 );
12}
If older versions are in use, a warning during evaluation with additional information
will be displayed. So if a Nextcloud server on NixOS has 19.09
as stateVersion
and Nextcloud 17 installed, the following things will happen when updating NixOS to 20.03
where Nextcloud 19 is the latest version:
-
The module assumes that a Nextcloud from 19.09 (i.e. v17) is installed and yields an evaluation warning which tells about the ongoing upgrade to v18 that will happen during the deploy and that v19 is the latest and recommended version. It also warns that upgrades across multiple majors are not possible and the ongoing upgrade to v18 should be finished first.
-
The administrator can now specify
nextcloud19
inservices.nextcloud.package
. -
After a second deploy, Nextcloud is at version 19.
So the module guides the administrator through the upgrade by checking
system.stateVersion
and services.nextcloud.package
and issuing relevant warnings as appropriate.
If you’re interested in the origins of the concept, it’s recommended to read the conversation in NixOS/nixpkgs#82353.
A new backport policy
Due to the approach mentioned above it’s not necessary anymore to pin an arbitrary major release
of Nextcloud to a stable NixOS branch. Instead, every new major release will be ported to each
active stable NixOS. In contrast to nixos-unstable
, the default for service.nextcloud.package
won’t be touched.
Because of that, administrators can completely decide on their own when to upgrade to a new major version. If a major release on a stable NixOS reaches the end of its upstream support period, it will be marked as insecure and will therefore not evaluate anymore by default. However, it’s possible to force that with an expression like this:
1{
2 nixpkgs.config.permittedInsecurePackages = [
3 "nextcloud-X.Y.Z"
4 ];
5}
Thoughts for the future
This solution to Nextcloud upgrade problems described above also solves another
issue with stateVersion
described above: if the decision which
package/structure/etc. to use is solely based on stateVersion
, older systems can be tied
to old software. With the approach via mkDefault
described here, this is not the case
anymore.
While it’s most convenient to test Nextcloud on existing instances (with a solid backup strategy), it is sometimes helpful to automate upgrade testing if the changes to be covered don’t exceed a reasonable level of complexity. An example for automated upgrade tests would be the Hydra test on 20.09 where a similar approach was necessary after Graham did an incredible job at optimizing Hydra’s DB schema.