Safe service upgrades using system.stateVersion

By Maximilian Bosch | Thu, 28 Jan 2021

One of the most important features for system administrators who operate NixOS systems are atomic upgrades which means that a deployment won’t reach an inconsistent state: if building a new system’s configuration succeeds, it will be activated in a single step by replacing the /run/current-system-symlink. If a build fails, e.g. due to broken packages, the configuration won’t be activated.

This also means that downgrades are fairly simple since a previous configuration can be reactivated in a so-called rollback by changing the symlink to /run/current-system back to the previous store-path. This is helpful if e.g. the configuration for a service is technically valid, but doesn’t do what it’s supposed to and thus has to be removed again. Or as my colleague Robin has said, “with NixOS you don’t have to be afraid of deploys anymore” (de).

While this model is revolutionary for configurations, it doesn’t really tackle modifications in the application’s state, such as the structure of a data-directory in e.g. /var/lib. A well-known example for this is postgresql, where manual intervention can become necessary when upgrading to a newer version.

In order to make sure that a newer configuration doesn’t contain potentially breaking changes without an explicit opt-in from an administrator, NixOS uses a mechanism named stateVersion.

Semantics of system.stateVersion

The stateVersion is a config-option that contains the version of NixOS that was used when the machine was initially provisioned:

1{
2  system.stateVersion = "20.09";
3}

Now, let’s say that a new version of postgresql was released and manual intervention is needed in order to get it running on existing setups. If the package itself gets updated in NixOS and will be deployed at some point to this example machine, the database process will break.

In the worst case, existing data will be corrupted when the service gets restarted after the deploy.

To avoid that, traditional config management systems would check e.g. for existing directory structures on the target system. For instance, the exec-resource in Puppet has an onlyif-option that ensures that the command will only be executed if a given condition evaluates to true.

However, this is not possible with NixOS where building a system’s config doesn’t happen at the time it’s activated. Instead, stateVersion could be used in the database module in nixpkgs:

1{ pkgs, lib, config, ... }:
2let
3  inherit (config.system) stateVersion;
4  package = if lib.versionOlder stateVersion "21.05"
5    then pkgs.postgresql_old
6    else pkgs.postgresql_new;
7in {
8  environment.systemPackages = [ package ];
9}

This code decides whether to install postgresql_old or postgresql_new in the system depending on the stateVersion. So why does this make sense?

  • If your stateVersion is at 20.09, it can be assumed that NixOS 20.09 is the first version installed.
  • The package postgresql_new is added to nixos-unstable which will turn into the next release (in this case 21.05) at some point.
  • This means that users of 21.05 and newer don’t have an existing, old version of postgresql. For every machine that ran on 20.09 at some point, postgresql_old will be used to make sure that no existing database will reach a broken state.

While this is a fairly useful concept, it has the problem that it’s not necessarily transparent to a sysadmin what’s happening inside and which configs will be selected. Also, it’s non-trivial for 20.09 systems to manually upgrade the database and use the new version after that without modifying system.stateVersion.

A more complex example, where this is actually needed and reading release notes is not necessarily sufficient, is Nextcloud.

Designing an upgrade path for Nextcloud

A fairly notable example where this approach becomes necessary is the self-hostable cloud-platform Nextcloud. This application uses a database, stores files in /var/lib and has a fairly stateful way of managing itself with configuration files in /var/lib/nextcloud.

Challenges for packagers

Packaging the Nextcloud service on NixOS turned out to be a non-trivial job. The issues can be summarized into the following two aspects:

  • Nextcloud isn’t really designed to be configured declaratively: instead, their own tool called occ is supposed to be used in order to generate PHP code which specifies the configuration. It will be written into the data-directory which is /var/lib/nextcloud on NixOS.

    The NixOS module uses occ at the first install and writes declarative config into a second PHP file in /var/lib/nextcloud for some degree of declarativity. However, Nextcloud’s configuration is still heavily tied to the stateful occ command.

    For instance, the maintenance mode can only be activated by occ.

  • Nextcloud doesn’t support upgrades across multiple major releases. For instance, if Nextcloud 18 is installed, it’s impossible to directly go to Nextcloud 20. Instead, an upgrade to Nextcloud 19 is needed first.

    Also, downgrades to earlier releases are not possible.

It becomes somewhat clear that one should be careful with Nextcloud updates. An accidental update can result in having to restore a backup and potentially losing data that was written after the latest backup. Hence, it’s important to carefully integrate system.stateVersion into the module to provide safe upgrades.

Let the user decide

First of all I’d like to thank my colleague fpletz who helped me working this out.

To make the latest version available without forcing users to upgrade, a package is available for each supported major release. At the time of writing these are nextcloud18, nextcloud19 and nextcloud20.

As a first measure, selecting the effective package can be done by the administrator with the option services.nextcloud.package. Due to that, the module doesn’t force the administrator using the stateVersion to use a certain package version.

But how does stateVersion come into play here? When services.nextcloud.package is set, it’s up to the administrator. But if it doesn’t a default version will be determined and set as default using mkDefault according to the following scheme:

  • If a Nextcloud version was released before e.g. 20.09, it will be selected by default for every stateVersion below 20.09.
  • If a new Nextcloud version is supposed to be added to nixpkgs, it will be default on NixOS unstable and the next upcoming NixOS release.

So, a simplified version of the expression in the Nextcloud module would look like this:

 1{ config, lib, pkgs, ... }: with lib;
 2
 3let inherit (config.system) stateVersion; in
 4{
 5  services.nextcloud.package = with pkgs;
 6    mkDefault (
 7      if versionOlder stateVersion "20.03" then nextcloud17
 8      else if versionOlder stateVersion "20.09" then nextcloud18
 9      /* ... */
10      else nextcloud20
11    );
12}

If older versions are in use, a warning during evaluation with additional information will be displayed. So if a Nextcloud server on NixOS has 19.09 as stateVersion and Nextcloud 17 installed, the following things will happen when updating NixOS to 20.03 where Nextcloud 19 is the latest version:

  • The module assumes that a Nextcloud from 19.09 (i.e. v17) is installed and yields an evaluation warning which tells about the ongoing upgrade to v18 that will happen during the deploy and that v19 is the latest and recommended version. It also warns that upgrades across multiple majors are not possible and the ongoing upgrade to v18 should be finished first.

  • The administrator can now specify nextcloud19 in services.nextcloud.package.

  • After a second deploy, Nextcloud is at version 19.

So the module guides the administrator through the upgrade by checking system.stateVersion and services.nextcloud.package and issuing relevant warnings as appropriate.

If you’re interested in the origins of the concept, it’s recommended to read the conversation in NixOS/nixpkgs#82353.

A new backport policy

Due to the approach mentioned above it’s not necessary anymore to pin an arbitrary major release of Nextcloud to a stable NixOS branch. Instead, every new major release will be ported to each active stable NixOS. In contrast to nixos-unstable, the default for service.nextcloud.package won’t be touched.

Because of that, administrators can completely decide on their own when to upgrade to a new major version. If a major release on a stable NixOS reaches the end of its upstream support period, it will be marked as insecure and will therefore not evaluate anymore by default. However, it’s possible to force that with an expression like this:

1{
2  nixpkgs.config.permittedInsecurePackages = [
3    "nextcloud-X.Y.Z"
4  ];
5}

Thoughts for the future

This solution to Nextcloud upgrade problems described above also solves another issue with stateVersion described above: if the decision which package/structure/etc. to use is solely based on stateVersion, older systems can be tied to old software. With the approach via mkDefault described here, this is not the case anymore.

While it’s most convenient to test Nextcloud on existing instances (with a solid backup strategy), it is sometimes helpful to automate upgrade testing if the changes to be covered don’t exceed a reasonable level of complexity. An example for automated upgrade tests would be the Hydra test on 20.09 where a similar approach was necessary after Graham did an incredible job at optimizing Hydra’s DB schema.

Do you have a question about Nix?

Just ask us! We would love to talk to you!

Check out our crew