Windows-on-NixOS, part 2: Make it go fast!

By Linus Heckemann | Wed, 17 Jun 2020

This is part 2 of a series of blog posts explaining how we took an existing Windows installation on hardware and moved it into a VM running on top of NixOS. Previously, we discussed how we performed the actual storage migration. In this post, we’ll cover the various performance optimisations we tried, what worked, and what didn’t work.

GPU passthrough

Since the machine is, amongst other things, used for gaming, graphics performance is critical. It has both Intel graphics, integrated into the CPU, and a discrete NVIDIA graphics card. We passed the graphics card through to the Windows guest, which allowed graphics output simultaneously from linux (via the integrated graphics) and from Windows (via the discrete graphics card). This was done primarily with the help of the Arch Linux wiki.

The relevant NixOS config for our Intel+nvidia system is as follows:

boot.kernelModules = ["vfio-pci"];, to enable the kernel module responsible for making a PCI(e) device available to virtualisation guests;
boot.blacklistedKernelModules = ["nouveau"];, to disable the nouveau driver for NVIDIA graphic cards. If it were allowed, it would automatically be loaded and bound to the graphics card, and prevent the vfio driver from taking control of the card. The driver can also be unloaded imperatively at runtime, but since this is fiddly¹ and we don’t anticipate needing the graphics card in linux, we opted to prevent it from being loaded in the first place;
boot.kernelParams = ["intel_iommu=on"];, to enable the IOMMU, which is responsible for managing DMA and is needed to map the graphics memory into the VM.

Then the VM required some configuration. First and foremost, we added the device itself: we found PCI addresses of the graphics card and its HDMI audio interface using lspci:

[...]
04:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
[...]

then added both to the VM using Add Hardware > PCI Host Device in virt-manager. The graphics card appeared in the VM, and we were able to see its graphical output on the TV connected to its HDMI port, but it only produced a very low resolution image, and graphics acceleration did not work. This turned out to be because Windows was using the generic VGA driver for it, as the nvidia driver refused to run, reporting only a mysterious “Error 43”. Apparently nvidia doesn’t like virtualisation, but tests specifically for a Microsoft hypervisor vendor string. This can thus be avoided by changing the Hyper-V vendor string to anything other than “Microsoft Hv” within the hyperv element in the features element of the VM config, and “hiding” the KVM virtualisation:

 1<features>
 2  [...]
 3  <hyperv>
 4    [...]
 5    <vendor_id state="on" value="whatever"/>
 6  </hyperv>
 7  <kvm>
 8    <hidden state="on"/>
 9  </kvm>
10  [...]
11</features>

Thanks very much to Jack Ford for his post on the subject of GPU passthrough, which brought this trick to our attention!

With this fix applied, the VM gave us glorious 4K output and smooth graphics performance.

CPU topology

A multicore system can have multiple processing elements at several levels:

Sockets: in hardware, these are physical sockets on the motherboard, and this level can be varied by end users: if a motherboard has multiple sockets, a user can buy one or several CPUs to place in the sockets;
Cores: within each CPU unit, there may be multiple instances of the silicon that does the processing, which allows for parallelism within a single chip;
Threads: one physical core may be able to perform multiple computations at once. This has become commonplace on Intel processors, where the feature is called “Hyper-Threading Technology”.

The physical machine has one CPU, with 4 physical cores, each with 2 logical cores. However, virt-manager’s default configuration is to emulate n sockets for a system with n logical CPUs. The Windows guest was not capable of making use of the extra cores in this topology automatically, which made it noticeably sluggish. Changing the topology in the CPUs section in virt-manager to match the physical CPU as well as the “Current allocation” (which doesn’t increase automatically!) made it significantly faster.

CPU pinning

Pinning virtual threads to physical threads enables the guest OS’s scheduler to perform its work more accurately, which can help make better use of the CPU caches and reduce costly context switches. This was implemented by adding the following to the VM’s XML definition:

 1<domain>
 2  [...]
 3  <cputune>
 4    <vcpupin vcpu="0" cpuset="0"/>
 5    <vcpupin vcpu="1" cpuset="1"/>
 6    <vcpupin vcpu="2" cpuset="2"/>
 7    <vcpupin vcpu="3" cpuset="3"/>
 8    <vcpupin vcpu="4" cpuset="4"/>
 9    <vcpupin vcpu="5" cpuset="5"/>
10    <vcpupin vcpu="6" cpuset="6"/>
11    <vcpupin vcpu="7" cpuset="7"/>
12  </cputune>
13  [...]
14</domain>

Huge pages

Huge pages allow allocating large contiguous chunks of memory to the VM, which can improve performance by providing a view of memory to the guest OS which is closer to physical reality. We enabled an allocation of 24GiB statically using kernel parameters, since we don’t currently anticipate running many memory-intensive things directly in the Linux host OS:

1boot.kernelParams = ["hugepagesz=1G" "hugepages=24"];

And to make libvirt use hugepages for the VM’s memory, we added a memoryBacking element to the domain config:

1<domain>
2  [...]
3  <memoryBacking><hugepages/></memoryBacking>
4  [...]
5</domain>

Network optimisations

VirtIO

qemu allows using the virtio interface to provide virtual NICs with less overhead than emulating physical cards. This requires drivers not built into Windows however, so we needed to download them via the emulated physical network hardware before changing the libvirt configuration to use virtio. Windows virtio drivers are available from the Fedora project; we downloaded the stable virtio-win iso file, then powered off the machine and changed its network card to use the virtio interface. After the next boot, we were able to install the drivers by:

Mounting the ISO image as a virtual CD drive (I was pleased to find that Windows has this feature built-in nowadays!)
Entering Windows’s Hardware Manager (accessible via a right-click menu on the Windows logo)
Telling it to install a driver for the unknown device, and to search in the virtual CD drive.

Bridging

Libvirt’s default setup is to set up a linux bridge device on which DHCP and DNS are provided through dnsmasq, and where the guest’s traffic is NATed to appear as if it comes from the host. This has some performance overhead and makes port forwarding necessary if we want to make services (e.g. game servers) available via the network from the VM. This can be avoided in multiple ways. We chose to create a bridge containing the physical uplink interface, so the VM appears like any other machine connected to the LAN. We did this by setting the NixOS option networking.bridges.br-lan.interfaces = [ "eno2" ]; (eno2 is the name of the physical network interface), then selecting the resulting bridge br-lan as the network source for the NIC in virt-manager.

Storage

The original setup was a single-HDD zfs pool, with an L2ARC (similar to a read cache) and slog (similar to a write cache) on an SSD. We had major storage performance issues with the VM, with boot times of around 5 minutes and extremely long response times from Windows’s UI. We experimented with various IO and cache modes for qemu, none of which seemed to help.

Not scrubbing while trying to use the VM

One of the first major improvements we managed to make was quite a silly one: we had been playing with the VM while zfs was scrubbing the pool, that is, reading all the data off the disk to verify its integrity. It took embarrassingly long for us to realise that this was why the disk was so busy. Unfortunately, this didn’t cure all our I/O performance ills!

Caching and backing hardware

One cause we’ve speculated on is that – since the block device exposed by qemu supports discard/TRIM operations, or because the filesystem was originally created on an SSD – Windows’s filesystem assumes it’s an SSD and applies a random-access-heavy queueing strategy optimal for SSDs (and awful for HDDs!). This would result in awful performance and keeping the hard drive busy with seeking almost all the time. We expected that this would be better when rebooting the VM, since the caches (both the RAM ARC and the SSD L2ARC) would be warmed up and no access to the HDD would be required, but this was also not the case.

What did improve performance a great deal was moving the zvol to an SSD-only pool. This lends some credibility to the theory of SSD-optimised access patterns, but still doesn’t explain why the caches didn’t improve boot speed. We have yet to investigate further, but will be sure to report back if we make any major discoveries! Our next step will probably be to install a fresh copy of Windows on the HDD pool and compare the performance, along with performing some proper measurements rather than just seeing how the performance “feels”.

Virtio (failure)

We tried switching the storage to virtio from emulated SATA. To do this, we couldn’t just change the type of the storage volume to virtio then boot the VM and install the driver, since the driver would be necessary for booting! For this reason, we first created a “dummy” virtio storage device which would simply serve as a device that we could install a driver for, as we did with the network device. After installing the device driver with this method and changing the boot device to virtio as well, the VM didn’t boot. We’ve currently postponed this effort and may take another shot at getting it working in the future, again by installing a fresh copy of Windows right to the virtio device.

Conclusion

That concludes the current state of the performance optimisations on our gaming VM. It now performs well enough to run a variety of games, including computationally intensive ones like No Man’s Sky. Unfortunately, this means that we will now be occupied with using it more than with tweaking it for a while :-) but stay tuned, eventually we’ll get bored of our games and go back to tuning the VM’s performance, most likely with a stronger focus on benchmarks and less on perceived performance.

Unloading the nouveau driver requires stopping everything that’s using devices provided by it. This generally includes:
- X server (systemctl stop display-manager) or Wayland compositor (how to stop depends on the compositor)
- Audio clients using the card’s HDMI output, e.g. pulseaudio
- Linux’s framebuffer console (echo 0 > /sys/class/vtconsole/vtcon$n/bind, $n may vary) Only after these steps can nouveau be unloaded using rmmod nouveau.
↩︎