Managing Day0/Day1 kmods with KMM

Some kmods might be installed without KMM. In order to enhance KMM's UX we could, in some cases, help customers to transition the lifecycle management of they kmods to KMM.

Definitions

Day 0

The most basic kmods that are required for a node to become “Ready” in the cluster

Examples * A storage driver that is required in order to mount the rootFS as part of the boot process. Vendors will usually work closely with the RHEL team to make those drivers in-tree so we won’t worry about them too much here. * A network driver that is required for the machine to access machine-config-server on the bootstrap node to pull the ignition and join the cluster

Day 1

Kmods that are not required for a node to become “Ready” in the cluster but would not be able to be unloaded once the node is "Ready".

Examples * An OOT network driver that replaces an outdated in-tree driver to exploit the full potential of the NIC while NetworkManager depends on it. Once the node is "Ready" a customer won't be able to unload the driver because of the NetworkManager dependency.

Day2

Kmods that can be dynamically loaded to the kernel or removed from it without interfering with the cluster infrastructure (such as connectivity).

Examples * GPU operator * Secondary network adapters * FPGA

Layering background

When a day0 kmod was installed in the cluster, it means that “layering” was applied through MCO and OCP upgrades won’t trigger node upgrades.

Unless a user wants to add new features to its driver, we will never need to recompile it for them since the node’s OS will remain.

With that being said, MCO has plans to rebuild the node images upon a cluster upgrade when Layering is used by MCO.

Using KMM for managing day0 and day1 kmods

We can leverage KMM to manage the lifecycle of day0/1 kmods without a reboot when the driver allows it. NOTE: It will not work if the upgrade require a node reboot (when rebuilding initramfs is needed for example)

1st option

By treating the kmod as an in-tree driver.

Nothing to do until user wishes to update the kmods.

When the user wishes to upgrade the kmod, they treat it as an in-tree driver and create a Module in the cluster with the inTreeRemoval field to unload the old version of the driver.

Characteristics

  • Down time - KMM will try to unload and load the kmod on all the selected nodes simultaneously.
  • Works in case removing the driver makes the node lose connectivity (because KMM uses a single pod to unload+load the driver)

2nd option

By using ordered upgrade.

In this case, user creates a versioned Module in the cluster representing the kmods - nothing will happen since the kmods are already loaded.

When the user wishes to upgrade the kmod, they use the ordered-upgrade feature.

Characteristics

  • No cluster downtime - the user controls the pace of the upgrade, and how many nodes are upgraded at the same time, therefore, an upgrade with no downtime is possible.
  • Doesn't work if unloading the driver results in losing connection to the node (because KMM will create 2 different worker pods, one for unloading and another for loading which won’t be scheduled)