User Guide

Welcome to the User Guide for YAOOK Operator! This document aims to explain the inner workings of YAOOK Operator; unlike the User Reference, this document is intended to be read linearly; the different chapters are organised in such a way that they build on top of each other. At the end, you should have the necessary understanding of how YAOOK works to be able to venture further in learning and running the system.

Note

Throughout this document, we will often refer to YAOOK Operator. In order to save us some typing and you some reading, we will often abbreviate that as just Operator. To distinguish this from the usual meaning of operator (e.g. the mathematical one, or, indeed, the operator in the context of Kubernetes or a control loop), the YAOOK Operator will always have a leading capital O.

We will use the same convention when we refer to Kubernetes objects such as Deployments, Secrets, Services etc. to distinguish them from the normal english use of these words.

Finally, before we get started, a word of warning: While YAOOK Operator attempts to make running OpenStack easier, it is never easy. OpenStack is a system of many microservices and it takes on a huge task, namely providing stable and reliable infrastructure as a service (IaaS). You will still need some level of expert knowledge in OpenStack both to deploy and to run YAOOK Operator. However, YAOOK aims to take away the boring stuff and leave you with the exciting breakages which cannot be solved automatically.

Introduction to YAOOK Operator

In this chapter we will look at Operator from a bird’s eye perspective: why does it exist? what are the core ideas? how does it differ from other systems? what are the main components?

What is it?

YAOOK Operator is a set of tools to deploy and operate OpenStack components atop an existing Kubernetes cluster (though YAOOK Kubernetes and YAOOK Bare Metal can be used to deploy such a cluster, too!). The YAOOK is, in fact, an acronym expanding to “Yet Another OpenStack On Kubernetes”, the reason for which we’ll go into in the next section.

Motivation

YAOOK Operator was started in summer 2020, because we were all bored and couldn’t go outsite because of that world-wide socio-evolutionary experiment which was going on. Well, not quite true. I mean, yes, there was that thing, but that wasn’t the reason why we did that.

The project was initiated by two companies who were unhappy with how they currently orchestrated their OpenStack. Both were using what we call “classical” configuration management systems, such as Chef, Ansible, Puppet, or Salt Stack. What these tools have in common and why we call them “classical” is that they:

  • Run only when triggered by the operator, or at best periodically with a “long” (hours or more) interval.

  • Have multiple sources of configuration input with several precedence levels which are merged in non-obvious ways.

  • Typically feature a rather stiff assignment of services to nodes, with potentially high friction when compensating for a node outage or onboarding new nodes.

  • Pseudo-declarativeness: Systems like Chef in particular look declarative (in the sense that you write the intended system state), but they are not fully declarative: When you remove a declaration, the thing isn’t automatically removed from the system. You need to declare instead that you don’t want this anymore.

Looking at how our customers were using the cloud, deploying Kubernetes on it and using the resources much more dynamically than we could in the IaaS layer made us envious and we thought there must be a better way.

The Operator Concept

At its core, Kubernetes is just a set of interconnected closed control loops. A closed control loop consists of four components: a controller (or what we call operator), a sensor, a “desired state” input and an actuator. The controller compares the value of the sensor to the desired state input and takes actions using the actuator to bring the reality (measured with the sensor) closer to the desired state.

This is also how Kubernetes works. Take for instance Kubelet. Kubelet has:

  • Sensors: the Container Runtime Interface (CRI) tells Kubelet which pods and containers are deployed on the node and what their state is.

  • Actuators: via the CRI, it can (attempt to) change the set and status of the containers and pods on the node.

  • Desired State Input: The Kubernetes API tells Kubelet which Pods should exist on the node.

That (and the code which attempts to make the Kubernetes API match the reality) makes Kubelet in fact a closed-loop controller. All of Kubernetes (with maybe some bits of RBAC and such which do not really fit this pattern) works this way.

This is a really powerful concept: Whenever the system state diverges from the intended state, processes are kicked off which try to rectify this. A Pod crashes? Restart it. Not enough Pods for the requested number of replicas? Add another Pod!

And we decided to harness this power. From our experience of running OpenStack, there’s always something not running completely smooth; having a bunch of control loops to bring the things which aren’t behaving back in line certainly seems like a good idea.

YAOOK Operator brings such controllers for each OpenStack service which is supported, as well as some resources which exist inside OpenStack (such as Keystone Users) and some services which are needed as infrastructure to run OpenStack (such as MySQL-compatible databases). The Operators are implemented using a statemachine which we will discuss briefly later (to the extent that knowledge is needed to run YAOOK Operator, which is not much, I promise!).

Each OpenStack service is thus represented as a Kubernetes custom resource (CR), specified by a custom resource definition (CRD). The Operator watches changes to this CR, as well as to any resources it creates to deploy the OpenStack service, and reconciles the desired state (what is in the CR) with the real state (what the resources it has created report) in order to bring the system toward convergence.

For instance, when you go ahead and delete the Secret containing the generated Keystone configuration, the Operator will go ahead and recreate it. If you edit the Keystone API Deployment to change the number of replicas, the Operator will reset that to the number of replicas requested in the CR (which, in turn, will make the Deployment controller of Kubernetes do its thing).

The operator concept is also why we decided to roll our own project instead of building on top of, for instance, OpenStack Helm. This is how we ended up making “yet another” OpenStack on top of Kubernetes. More details on the magic the Operator does is found in The YAOOK Statemachine.

Configuration Management

(Almost) all configuration which you send into the OpenStack cluster (via the CRs) is passed through a CUE validation layer. This provides the project with two key advantages:

  1. We can do validation against the schema the respective OpenStack service exposes for its configuration (if any), as much as it is representable in CUE.

  2. CUE comes with a mechanism for merging multiple sources of configuration, which we can use directly. This is especially useful for situations where we want to merge node-level configuration with global configuration, for instance, or where YAOOK Operator automatically generates configuration (e.g. credentials and URLs to infrastructure services) and injects those.

    (Spoiler: that merging mechanism is “if it’s not equal, we reject it”. This is great, because this is a commutative operation, which means that we don’t need precedence levels. It is obvious which value will end up in a configuration file.)

That should suffice for now. The extensive details will be discussed in Configuration Concepts.

yaookctl

Before we go into the next chapter, we need to talk briefly about yaookctl. yaookctl is a command-line tool which is supposed to help you to interact with your YAOOK Operator cluster by summarising information and providing shorthands for otherwise complex shell commands (such as “find me the cell1 database of my Nova deployment and get me a privileged SQL shell”, or “get me a stream of the merged logs of all Keystone API pods, JSON-decoded and nicely formatted”).

See the README of yaookctl for a command reference as well as installation instructions.

Scheduling: Labels & Taints

This section assumes that you are familiar with how Kubernetes uses Labels and Taints for controlling the scheduling of workload in a cluster. If you are not, now is a great time to read Assigning Pods to Nodes and Taints and Tolerations from the Kubernetes documentation.

Back already? Great, let’s go on.

Each workload deployed by YAOOK Operator has a set of what we call Scheduling Keys. Each Scheduling Key corresponds to a Kubernetes Label and a Kubernetes Taint, which the workload selects for in its node affinity and tolerates in its tolerations.

That means if a workload has the ficticious example.yaook.cloud/foo and example.yaook.cloud/bar Scheduling Keys, it will run on any node which has any of the example.yaook.cloud/foo or example.yaook.cloud/bar labels with any value. It will also tolerate taints which have the example.yaook.cloud/foo and example.yaook.cloud/bar keys (with any effect).

This simple yet powerful concept allows to trivially designate nodes for particular purposes. For instance, if you add the (non-ficticious) compute.yaook.cloud/hypervisor taint and label to a node, only Nova Compute services (plus their direct dependencies, such as the layer 2 component of Neutron (which also has the compute-hypervisor scheduling key)) will run on that node.

No other taints except those from the scheduling keys are tolerated. It is not possible to change the scheduling keys associated with a workload. That implies that you cannot run YAOOK Operator workload on (tainted) Kubernetes control-plane nodes. If you want to set up such a hyperconverged setup, you’d have to remove the control plane taint.

There is a full reference of scheduling keys available at yaook.op.common.SchedulingKey.

Note

If you do not set any labels, no workload will spawn. Pods will be stuck in Pending state (or will not even be created). If you find that rollout gets stuck somewhere, double-check that you have the necessary Scheduling Key labels sets.

Note

Reminder: Label values do not matter for scheduling. In Configuration Concepts we will come across a mechanism where label values do matter, however.

Common Node Roles

In this section we will describe a common pattern on how to organize your nodes:

Kubernetes Control Plane

This is mostly self-explanatory. Leave the Kubernetes Control plane alone on its own nodes to avoid interference between that and the OpenStack system. Keep the node-role.kubernetes.io/control-plane taint where it is and don’t add tolerations for it.

OpenStack Control Plane

Those run all the API services, their databases, message queues and memory caches. For that, you need the following scheduling keys:

  • any.yaook.cloud/api

  • infra.yaook.cloud/any

  • operator.yaook.cloud/any

  • key-manager.yaook.cloud/barbican-any-service

  • block-storage.yaook.cloud/cinder-any-service

  • compute.yaook.cloud/nova-any-service

  • ceilometer.yaook.cloud/ceilometer-any-service

  • key-manager.yaook.cloud/barbican-keystone-listener

  • gnocchi.yaook.cloud/metricd

  • infra.yaook.cloud/caching

  • network.yaook.cloud/neutron-northd

Gateway Nodes

These run Neutron layer 3 services and have extra physical interfaces which connect them to the outside world (provider networks). Depending on which Neutron network setup you run, you use different labels:

For OVS:

  • network.yaook.cloud/neutron-l3-agent

  • network.yaook.cloud/neutron-dhcp-agent

  • network.yaook.cloud/neutron-bgp-agent (if using BGP)

For OVN:

  • network.yaook.cloud/neutron-ovn-agent

  • network.yaook.cloud/neutron-ovn-bgp-agent (if using BGP)

Note

In the case of OVN, the actual “gateway-ness” of a node is not determined by the label (unlike OVS) but by the presence of bridgeConfig in the configTemplates for that node (we will go into details on what that means later). This is a YAOOK limitation which may be fixed eventually (GitLab issue #466) as it prevents connecting VMs to provider networks without also running routers on compute nodes.

Compute Nodes

These run Nova Compute and services which Nova Compute requires. It is recommended to taint them, too, in order to avoid other workload stealing resources from your customer VMs. The scheduling key is compute.yaook.cloud/hypervisor.

Note that this is by no means the only way you could do that; the fast flexibility of the Scheduling Key system allows you to spread out your control-plane to as many nodes as you need.For instance, you could have dedicated database and message queue nodes by removing the infra.yaook.cloud/any scheduling key from your control-plane nodes and using the dedicated labels (infra.yaook.cloud/db, infra.yaook.cloud/mq).

The flexibility of the scheduling system is high, but often it suffices if you start out with just Compute, OpenStack Control Plane (merging the Gateway Nodes role into these) and Kubernetes Control Plane nodes and then scale your scheduling setup to the needs of your cluster.

The YAOOK Statemachine

The YAOOK statemachine is the infernal machine heart of the YAOOK Operator. While we will not go into the implementation details of the statemachine in this user guide, we will look at some of its core concepts, touch even a little on code, but generally focus on how the statemachine affects users.

The Component Graph

Each custom resource managed by Operators corresponds to a directed acyclic graph, the component graph. The nodes of that graph are resources which are managed by the Operator (those may be Kubernetes resources or external resources, such as OpenStack users) and the edges are dependencies between these resources. A resource node which is part of a custom resource is also called a component.

For example, a simplified Keystone Operator could look like this:

A directed graph with the following nodes: "MySQLService", "DB sync Job", "Bootstrap Job", "API Deployment", "Admin Credential Secret", "Endpoint ConfigMap", "API Service". Arrows between the nodes show dependencies: The DB sync Job depends on the Config Secret which in turn depends on the MySQLService. The Bootstrap Job depends on the DB sync Job. The API Deployment, Admin Credential Secret and Endpoint ConfigMap all depend on the Bootstrap Job. Finally, the API Service depends on the API Deployment.

Graphical visualisation of a simplified implementation of the Keystone Operator.

This is mostly following the conceptual steps needed for deploying Keystone: You need a database and a Keystone config before you can run db_sync (and the Keystone config needs the database to know the connection URL and database credentials). Once you’ve done that, you run the bootstrap task which initialises the database with an admin user and project. Afterward, you can deploy the API and generate secrets and configs for connecting to the Keystone service. Finally, once the API is up, you can make it available using a Service.

As writing this down in linear code would be error prone and cluttery, the statemachine library was written. This allows to write this down in code nearly as clearly as this graph:

class KeystoneDeployment(sm.ReleaseAwareCustomResource):
    API_GROUP = "yaook.cloud"
    API_GROUP_VERSION = "v1"
    PLURAL = "keystonedeployments"
    KIND = "KeystoneDeployment"

    database = sm.TemplatedMySQLService(..)

    config = sm.CueSecret(
        sm.DatabaseConnectionLayer(
            ..,
            db=database,
        ),
    )

    db_sync = sm.TemplatedJob(
        ..,
        add_dependencies=[config],
    )

    bootstrap = sm.TemplatedJob(
        ..,
        add_dependencies=[
            db_sync,
            config,
        ],
    )

    api = sm.TemplatedDeployment(
        ..,
        add_dependencies=[
            db_sync,
            bootstrap,
            config,
        ],
    )

    service = sm.TemplatedService(
        ..,
        add_dependencies=[api],
    )

Note

This code example does not match the real code; it leaves out a lot of details which should not worry you for now; this document doesn’t aim to make you a YAOOK Operator developer after all, but a user.

If you have written code for things like SQLAlchemy, the concepts will look familiar. Otherwise, let’s take a brief tour. The first line declares the KeystoneDeployment class. By inheriting from ReleaseAwareCustomResource, we make it clear that we want this to be a custom resource definition (and we complete that declaration with the API_GROUP, API_GROUP_VERSION, PLURAL, and KIND values, which match those of the Kubernetes CRD).

Following that are the definitions of the components we saw earlier in the graph (omitting the endpoint configmap and admin credential secret for brevity). The components each have a name (the left-hand side of the = on the line) and a type with parameters (right-hand side of the =, until the closing )). Included in those parameters are the dependencies. Looking at a snippet in detail:

bootstrap = sm.TemplatedJob(
    ..,
    add_dependencies=[
        db_sync,
        config,
    ],
)

This declares the component bootstrap, which is a TemplatedJob (i.e. a Kubernetes Job generated from a template), with explicit dependencies on db_sync and config (which have been declared above).

Why is this important? You will in some places see references to the name of these components. When troubleshooting, it will be important to understand how to look up what such a name refers to: Is it a database service? Is it the API deployment? That will tell you the next step to investigate why things get stuck or seem broken.

If you find yourself in the situation that you need to “resolve” a component name to a resource type, the best way is to run grep on the source tree of YAOOK, starting in the yaook/op code directory.

Note

Future versions may include more details in the status so that you don’t have to do this manually anymore.

Reconciliation Process

When the Operator now observes a change (or creation) of a CR under its watch, it queues it for reconciliation (which would then happen as soon as it has a worker thread available).

During reconciliation, the Operator steps through the components of the dependency graph, always processing the nodes which do not have un-ready dependencies. A dependency is considered ready if it has been reconciled in this run and the resource it represents is “ready”. What exactly “ready” means depends on the specific resource: A Secret is considered ready when it exists, a Deployment is considered ready when all its replicas are up, healthy, and of the newest version. Generally, readiness is when the resource can reasoably be assumed to be consumable by the dependency.

The implementation of stepping through the nodes of the graph (the components) is what we call the YAOOK statemachine.

The reconciliation process can be observed in the operator logs when the operators run with a log level of at least info:

2023-09-01 05:39:49,370    INFO  yaook.op.daemon.yaook.cloud/v1.keystonedeployments.yaook.keystone  reconciling [ 52% ( 23+  0/ 44)] <yaook.op.keystone.cr.EmptySecret component='fernet_keys'>

These log messages are good to understand what is going on at the time of a failure, so we’ll dissect it into its components:

Information

Example field

Description

Timestamp

2023-09-01 05:39:49,370

The exact time when the log message was written

Log level

INFO

The so-called log level. The levels which exist are DEBUG, INFO, WARN, ERROR, FATAL, in ascending order of severity.

Module

yaook.op.daemon.yaook.cloud/ v1.keystonedeployments.yaook.keystone

The context in which the log message was emitted.

Message

reconciling [ 52% ( 23+  0/ 44)] <yaook.op.keystone.cr.EmptySecret component='fernet_keys'>

Free-form log message.

In the specific example, the Module is in fact a reference to a custom resource. We know this because it contains a /; the Module field only contains a / if it contains an API group and version. When reading operator logs, it is useful to distinguish general housekeeping or watch handling the operator does from a specific reconciliation loop. You can do this by understanding when a log message is in relation to a specific CR (then it is from the reconciliation itself) and when not (then it is from other tasks).

When the Module contains a /, the message is in relation to a reconciliation run and the format is then the following: yaook.op.daemon, followed by the API group and version, the plural, the namespace and then the resource name. In this case, the API group and version is yaook.cloud/v1, the plural is keystonedeployments, the namespace is yaook and the resource name is keystone. For the keystone operator this information is rather boring, but for e.g. the infra-operator it is the only way to know in the context of which resource a log message is emitted.

The example message also is useful, because it is right from the state machine execution. It contains the progress of the current reconciliation run. We see this because it starts with reconciling, followed by progress information in square brackets.

The progress information (reconciling [ 52% ( 23+  0/ 44)] <yaook.op.keystone.cr.EmptySecret component='fernet_keys'>) has the following format:

Information

Example fragment

Description

Fixed text

reconciling

Fixed prefix text of reconciliation status messages

Overall progress

[ 52%

Percentage of components which have been reconciled already in this run

Ready components

( 23

Absolute number of components which have been deemed ready

Blocked components

+  0

Absolute number of components which are not ready

Total components

/ 44)]

Total number of components in this custom resource

Component type

<yaook.op.keystone.cr.EmptySecret

This is the Python type of the component which is to be reconciled next. In this case, it is an EmptySecret.

Component name

component='fernet_keys'>

Name of the component which is to be reconciled, in this case fernet_keys

A reconciliation run may have different results. The result is stored in the status field of the CR and is displayed by yaookctl status .., as well as kubectl get.

  • Completed, Updated: All components have been reconciled and are ready.

  • WaitingForDependency: There have been no errors, but at least one component could not be processed because a dependency was not ready. The list of non-ready components is included in the status message. This is where the knowledge above about what component names are and where to find them comes in handy.

  • InvalidConfiguration: The configuration provided in the custom resource (e.g. OpenStack config snippets) could not be merged to a consistent and valid configuration. The status message contains more details. See Configuration Concepts.

  • BackingOff: The reconciliation run ran into an unexpected error and the Operator waits for a moment before retrying. The delay increases exponentially on each subsequent failure up to two minutes (Tip: the Operator only keeps the back-off interval in memory. If you need it to re-try faster, you can just restart it). The status message contains more details about the error.

Oftentimes, more than one reconciliation run is needed to enter the Updated state: many runs will end in WaitingForDependency, for instance while a Job is running, a database is being deployed or a Deployment is updating its ReplicaSets and Pods. We call the process from the initial trigger (a change to the custom resource) until the Updated state is reached a “rollout”.

Child Resource Labels

To keep track of Kubernetes created in the context of a custom resource managed by an Operator, specific labels are used. In this context, we call the custom resource the Operator manages the parent and the resources which are created in that context the children. Note that a child may in turn be managed by another Operator: A common example is a MySQLService which gets created for a KeystoneDeployment. Both are managed by YAOOK Operators, so the MySQLService is both a child (of the KeystoneDeployment) and a parent (for its database StatefulSet, for instance).

These labels are set on the child resources:

state.yaook.cloud/parent-group

The API group of the parent resource, commonly yaook.cloud (but other groups like network.yaook.cloud may be encountered)

state.yaook.cloud/parent-version

The API version of the parent resource, commonly v1

state.yaook.cloud/parent-plural

The API plural of the parent resource, for instance keystonedeployments or mysqlservices. Note that in Kubernetes, there is no clear mapping from “kind” to “plural” and vice versa, which is unfortunate; they cannot be used interchangably. The plural is however the “harder” of the two, being used in various API paths of Kubernetes and thus must be programmatically known.

state.yaook.cloud/parent-name

The metadata name of the parent resource.

state.yaook.cloud/component

The component of the parent resource in the context of which this resource was created.

state.yaook.cloud/instance

In some cases, a component may consist of multiple instances. For example, each NovaComputeNode object created is an instance within the compute_nodes component of the NovaDeployment resource.

These labels are useful to find resources related to a custom resource. For instance, to list all secrets related to a KeystoneDeployment called keystone, you would run:

$ kubectl -n yaook get secret -l state.yaook.cloud/parent-plural=keystonedeployments,state.yaook.cloud/parent-name=keystone
NAME                             TYPE                DATA   AGE
credential-keys-6dlc4            Opaque              3      5d23h
fernet-keys-pn8hc                Opaque              3      5d23h
keystone-admin                   Opaque              6      5d23h
keystone-api-certificate-67p8d   kubernetes.io/tls   3      5d23h
keystone-api-db-user-hpkhc       Opaque              1      5d23h
keystone-config-vdr26            Opaque              1      2m45s

In addition, there is a label state.yaook.cloud/orphaned. This is set when the resource is no longer actively managed by the Operator, but still in use by a dependee. This can happen for instance when a new version of a service configuration is created: As we treat configuration Secrets as immutable (see later for why), a new Secret gets created and the old one is orphaned. The users (e.g. a Keystone API Deployment) get updated to reference the new secret. The old secret is however kept around until the rollover has completed, to defend against faulty configurations.

At the end of each reconciliation run, the Operator looks for orphaned resources which are not used by any other resource under its control anymore. They are then deleted.

Note

The labels are exactly the only way how the Operator keeps track of resources. Names are irrelevant. Use this knowledge wisely.

In addition to these labels, Kubernetes Owner Referenes are set on all objects created by the Operator. This ensures that all resoures are cleaned up by the Kubernetes Garbage Collector when the custom resource is deleted.

Pausing and Unpausing

Sometimes, you need to do manual work on a resource, bypassing what the Operators do. This can happen during disaster recovery or when other unexpected or unsupported states occur. There are two ways to achieve this:

  1. You can stop the operator. This prevents it from reconciling any resource under its control. Sometimes, this is acceptable or even desirable. Often times, however, finer control is needed.

  2. You can pause a specific custom resource.

To “pause” a resource, you need to set an annotation state.yaook.cloud/pause on it. If the Operator finds this annotation at the start of a reconciliation run, it will skip the run and remove the resource from its queue. The value of the annotation does not matter. When you remove the annotation, the Operator notices the change in the resource and re-runs the reconciliation.

Warning

You can not use the pause annotation to interrupt an ongoing reconciliation run.

It is common practice to set the value of that annotation to a reason why the resource was paused, for your future self to read in case you forget to remove the annotation.

Note

Some resources require regular activity by the Operators, such as MySQLService instances. These need regular restarts in order to load fresh TLS certificates for their replication traffic. If you keep them paused for longer than the certificate lifetime, you’ll be in a terrible situation.

To easily pause a resource, you can use yaookctl pause <PLURAL> <NAME>:

$ yaookctl pause mysqlservices keystone-v92z7
$ kubectl get -n yaook mysqlservices keystone-v92z7 -o json | jq .metadata.annotations
{
    "state.yaook.cloud/last-update-timestamp": "2023-08-25T11:57:47.173641Z+muV8wdEIZx8",
    "state.yaook.cloud/pause": "paused using yaookctl on 2023-08-31 12:14:55.414552"
}

Unpausing works conversely:

$ yaookctl unpause mysqlservices keystone-v92z7
$ kubectl get -n yaook mysqlservices keystone-v92z7 -o json | jq .metadata.annotations
{
    "state.yaook.cloud/last-update-timestamp": "2023-08-25T11:57:47.173641Z+muV8wdEIZx8"
}

OpenStack as Custom Resources

YAOOK Operator maps the various pieces which together form an OpenStack cluster to Kubernetes resources in order to make them manageable through Kubernetes controllers (which are the YAOOK Operators). These custom resources can be categorized in four classes:

  1. OpenStack services: Those directly represent a specific OpenStack service project. Those are generally named WhateverDeployment, for instance KeystoneDeployment. For a full reference please see the reference documentation of the yaook.cloud/v1 API group. We also call these top-level resources because those are generally the resources you create manually.

  2. Infrastructure services: These are additional services, the implementation of which generally live outside the OpenStack project ecosystem, which are however required to run an OpenStack cluster. Currently, there exist three such services: AMQPServer (message queue), MySQLService (database), MemcachedService (cache service), and OVSDBService (OpenvSwitch database).

  3. OpenStack and infrastructure resources: Some OpenStack resources may be managed by YAOOK Operators and thus also have a representation in Kubernetes. Most notable of those are KeystoneUser and KeystoneEndpoint, which are used extensively by the Operators to create and manage credentials of their services. The same holds for users in infrastructure services (e.g. the MySQLUser resource).

  4. Subservices: Neutron and Nova are complex beasts and they handle highly interruption sensitive and stateful end-user workload. At some point, it made sense to separate some of the subservices which belong to Nova or Neutron (for instance, the per-node Nova Compute instance) into separate Kubernetes resources in order to manage their life-cycle separately and explicitly.

This chapter can only give a brief overview of the resources and concepts in the various categories. Category 4 will also receive some attention when we talk about Day-two Operations and eviction later.

As mentioned, the first category maps directly to an OpenStack service project, such as Keystone, Nova, Neutron, Cinder and so on. Such a resource represents the entirety of this service, including all required infrastructure services and users, OpenStack resources (users, endpoints), and subservices (nova-compute services, neutron agents).

Deleting such a resource causes it to be, well, deleted. Everything gone, poof [1]. You may want to deploy some kind of validation webhook to prevent that, by the way.

The fields of these resources more-or-less directly map to options of the respective OpenStack service, related Kubernetes resources (such as a Deployment’s replica count), subservices or infrastructure services, and as such are highly dependent on the specific service. The reference documentation of the yaook.cloud/v1 API group documents those options in great detail.

The second category are infrastructure services such as the MySQLService. These are life-cycled by the infra-operator, taking care that they always have up-to-date certificates for their frontend and internal traffic as well as handling service upgrades during OpenStack release upgrades. Otherwise, they are conceptually rather similar to the first category. In contrast to the first category, however, they generally do not and cannot expose a public API of any kind (and thus don’t need externally valid certificates). Their reference documentation is found in the reference documentation of the infra.yaook.cloud/v1 API group documents those options in great detail. Their status can be seen in yaookctl using yaookctl status infra.

The fourth category will, as mentioned, be discussed in more detail when we talk about upgrades and eviction.

That leaves the third category. The third category represents objects existing in systems outside Kubernetes, mostly users. These users are created and their passwords managed by the corresponding operator (KeystoneUser and KeystoneEndpoint are handled by the keystone-resources-operator, the infrastructure service users (AMQPUser and MySQLUser) are handled by the infra-operator). In addition, the resources are decorated with a finalizer, so that the responsible Operators get a chance to delete the resource from the external system before it vanishes from the Kubernetes API.

It is in general not recommended that you create such resources on your own, except for testing purposes. In particular, the KeystoneUser resource is not designed to be used for end-user accounts; all users are automatically granted rather wide privileges. In addition, in the future, we may decide to change that resource to the point that it allows to give rather narrow permissions, but then scoped to those levels needed by OpenStack services, not to levels useful to users.

Still, it is useful to know that these resources exist, as you might encounter them in debugging situations. Their status is visible in yaookctl using yaookctl status credentials.

A reference of finalizers is available in Finalizer Reference.

Cross-service dependencies

Some OpenStack services depend on others. There are three kinds of dependencies:

  • API level dependencies

  • Configuration level dependencies

  • Node level dependencies

API level dependencies are generally resolved by all services talking to the same Keystone and discovering each others endpoints there. No extra action is required by YAOOK Operator users here, except having a single Keystone instance and referencing that from all of your services using the spec.keystoneRef field in their CRs.

There is one configuration level dependency which YAOOK resolves for you, which is linking the metadata services of Neutron and Nova together. For this, you need to reference the NovaDeployment from the NeutronDeployment via the spec.novaRef field in the NeutronDeployment.

Other configuration level dependencies need to be resolved by the user by configuring the services correctly. One prominent example is the RBD secret UUID, which needs to match in Nova and Cinder configurations in order for volume attachments to work correctly. In general, YAOOK cannot help you there because there is no component which has a view beyond a single OpenStack service.

Finally, there are node level dependencies. The only such dependency implemented in YAOOK is between Neutron layer 2 agents and their dependents: Nova Compute and Neutron layer 3/dhcp/bgp agents. For instance, Nova Compute requires a Neutron layer 2 agent on the same node in order to get VM network interfaces plugged into the correct software-defined network.

The dependency between Neutron layer 2 and the dependents is modelled using labels and annotations. Once the layer 2 agent is first deployed completely, it sets a label indicating whether it needs to go down for maintenance (maintenance.yaook.cloud/maintenance-required-l2-agent). This is typically false right after the initial deployment.

The dependents take this label, when set to false, as an indicator that the node is ready and will start to schedule the services on that node. When they start scheduling workload, they set an annotation (l2-lock.maintenance.yaook.cloud/..) with a suffix corresponding to the depednent service.

If the layer 2 agent needs to perform a disruptive action and there is at least one lock annotationon the node, it changes the value of the maintenance.yaook.cloud/maintenance-required-l2-agent label to true. The dependent services will then take the appropriate action to clear out the node, for instance by triggering an eviction.

The dependents will remove their lock annotation. Once all lock annotations are gone, the layer 2 agent can proceed with its disruptive action as it can now be confident that no dependents are left which could be disrupted.

Configuration Concepts

YAOOK Operator manages several different components, most of which need their own configuration. The key challenge is to manage the flow of configuration from the user (you) to the service (e.g. Keystone).

As mentioned earlier, one of the key motivations for YAOOK and the core ideas is to keep configuration simple, but in the right ways. Obviously there need to be ways to exactly specify the configuration of all components (unless it would in 100% of the cases break stuff, e.g. if you set an incorrect database password). However, it is equally important that when debugging why things don’t work as they should why that is the case. That often involves understanding where configuration comes from (or why it differs from what you expect it to be). It must be obvious how to change a configuration setting.

CUE primer

In order to achieve this, we use CUE. In a single sentence, CUE is a data language which can be used to describe schemas with defaults, merge data into those schemas, and export the result for instance as JSON. Almost all configuration we pass to services we manage gets processed by cuelang.

You won’t have to understand the full details of how CUE works, so we will just scratch the surface here. Let’s start with an example. Assume we have two files, each of which a snippet of Nova configuration which the user provided. Because CUE works with JSON-like formats, we represent the OpenStack INI-like format as JSON, which mostly works.

foo.cue
{
    "DEFAULT": {
        "debug": true
    }
}
bar.cue
{
    "DEFAULT": {
        "use_syslog": true
    },
    "libvirt": {
        "virt_type": "qemu"
    }
}

Now we can ask CUE to process these two snippets in the same way we process configuration [2] in YAOOK and emit it back as JSON:

$ cue export --out json foo.cue bar.cue
{
    "DEFAULT": {
        "debug": true,
        "use_syslog": true
    },
    "libvirt": {
        "virt_type": "qemu"
    }
}

As we can see, JSON objects were nicely merged. Now let us try with a third block:

baz.cue
{
    "libvirt": {
        "virt_type": "kvm"
    }
}

What happens if we also try to include that?

$ cue export --out json foo.cue bar.cue baz.cue
libvirt.virt_type: conflicting values "kvm" and "qemu":
    bar.cue:1:1
    bar.cue:4:26
    baz.cue:1:1
    baz.cue:3:26

CUE rejects the input! This is because CUE does not allow conflicting values, which is part of why we chose it. The idea here is that if we don’t allow conflicting values, there can never be ambiguity as to which configuration value will end up in a configuration file. If we tried to specify this configuration in a custom resource, the reconciliation run would fail (early) with an InvalidConfiguration state.

Note

There is much more to CUE, such as type validation. In fact, CUE does allow for one level of overriding (or rather: one level of defaults), but this cannot be specified via JSON input and is thus not available to the user. YAOOK Operator uses this in order specify defaults for some options which OpenStack requires and for which we want to provide an opinionated default (however, other configuration options are hardwired by YAOOK and cannot be overwritten, such as database connection strings which are forced to match the actual hostname/port of the database managed by YAOOK).

Providing OpenStack configuration

Most YAOOK custom resources which represent OpenStack services have a field like keystoneConfig or novaConfig. The usage inside e.g. a KeystoneDeployment resource looks like this:

apiVersion: yaook.cloud/v1
kind: KeystoneDeployment
metadata: ..
spec:
    keystoneConfig:
        DEFAULT:
            debug: true
            use_syslog: false
        database:
            db_max_retries: 10
    ..

This would generate an OpenStack configuration file like this:

[DEFAULT]
debug=true
use_syslog=false

[database]
db_max_retries=10

… plus a lot of things injected by YAOOK Operator, such as the database connection settings.

Sometimes this is not enough. If you need to provide configuration options which are confidential, you may not want to put them into a KeystoneDeployment object but would prefer to keep them in a Kubernetes Secret, for instance because these have encryption at rest enabled while KeystoneDeployments do not. For this use case, there is another field called (for Keystone) keystoneSecrets (other services have that, too).

This allows to inject keys from Kubernetes Secrets into the configuration. For example, to configure an LDAP password, you could use:

---
apiVersion: v1
kind: Secret
metadata:
    name: ldap-config
    ..
data:
    password: bm9tb3Jlc2VjcmV0
---
apiVersion: yaook.cloud/v1
kind: KeystoneDeployment
metadata: ..
spec:
    keystoneConfig:
        DEFAULT:
            debug: true
    keystoneSecrets:
    - secretName: ldap-config
      items:
      - key: password
        path: /ldap/password
    ..

This has the same effect as writing nomoresecret into spec.keystoneConfig.ldap.password, with the exception that you don’t expose the password in the KeystoneDeployment resource.

Note

At the time of writing, only string values can be injected into the configuration this way. Keys expecting non-string values will fail the CUE typecheck, because the data extracted from secrets is always treated as a string.

See also

KeystoneDeployment.spec

Reference documentation of the KeystoneDeployment spec, mentioning the structure of keystoneConfig and keystoneSecrets.

Note

If you want to inspect configuration generated by YAOOK, you can do so by looking for Secrets related to the resource, for instance with:

$ kubectl -n yaook get secret -l state.yaook.cloud/parent-plural=keystonedeployments
NAME                             TYPE                DATA   AGE
credential-keys-6dlc4            Opaque              3      5d23h
fernet-keys-pn8hc                Opaque              3      5d23h
keystone-admin                   Opaque              6      5d23h
keystone-api-certificate-67p8d   kubernetes.io/tls   3      5d23h
keystone-api-db-user-hpkhc       Opaque              1      5d23h
keystone-config-vdr26            Opaque              1      2m45s

Per-node Configuration

Some services, Nova and Neutron in particular, may need highly node-specific configuration options. In Nova, for instance, you may want to specify certain PCIe devices to be available for passthrough on some nodes. For Neutron, there commonly are “gateway nodes” which need access to upstream or so-called “provider” networks, which compute nodes don’t need or have.

In order to allow such fine-grained configuration, YAOOK supports injecting configuration into node-bound services (like nova-compute or neutron agents) based on node labels. The syntax and semantics of these configTemplates is non-obvious, which is why we’ll take a bit of time here to go through it in more detail than the configTemplates API reference can.

We will use Nova as an example, because it is a bit simpler than the various Neutron services:

apiVersion: yaook.cloud/v1
kind: NovaDeployment
metadata: ..
spec:
    compute:
        configTemplates:
        - ..
    ..

Inside the configTemplates key, there is a list of templates. Each template consists of a nodeSelector which specifies to which nodes the configuration applies and further service-specific keys which correspond to the configuration to apply. These differ between Nova and the various Neutron services, but the nodeSelector concept is common.

The nodeSelector is similar to the nodeSelectorTerms you can find in the Kubernetes Node Affinity specification, but it only supports matchLabels (not matchExpressions):

configTemplates:
- nodeSelectors:
  - matchLabels:
      a: foo
      b: bar
  - matchLabels:
      c: baz

The terms under nodeSelector are OR-ed together, while the labels under matchLabels are AND-ed together. This means that the above example would select all nodes which appear in kubectl get node -l a=foo,b=bar plus all nodes which appear in kubectl get node -l c=baz.

A config template with only selectors is not useful; it does not contain any configuration which is assigned. Going back to the specific example of Nova, one could use this to set specific options in the nova compute configuration like this:

apiVersion: yaook.cloud/v1
kind: NovaDeployment
metadata: ..
spec:
    compute:
        configTemplates:
        # First, we want to enable debug logs on all nodes
        - nodeSelectors:
          - matchLabels: {}  # this matches all nodes!
          novaComputeConfig:
            DEFAULT:
              debug: true

        # Then we want to set the virt_type based on a label value, either qemu or kvm
        - nodeSelectors:
          - matchLabels:
              compute.yaook.cloud/hypervisor: qemu
          novaComputeConfig:
            libvirt:
              virt_type: qemu
        - nodeSelectors:
          - matchLabels:
              compute.yaook.cloud/hypervisor: kvm
          novaComputeConfig:
            libvirt:
              virt_type: kvm
    ..

If we now have a node labelled compute.yaook.cloud/hypervisor=qemu, the following configuration would be generated for it:

[DEFAULT]
debug=true

[libvirt]
virt_type=qemu

In this example, it is not possible to generate conflicting configuration: the only key we set in multiple blocks is libvirt.virt_type, and those templates select strictly disjunct node sets (same label, but different value). If we had instead written:

- nodeSelectors:
    - matchLabels:
        compute.domain.example/virt-qemu: "true"
    novaComputeConfig:
    libvirt:
        virt_type: qemu
- nodeSelectors:
    - matchLabels:
        compute.domain.example/virt-kvm: "true"
    novaComputeConfig:
    libvirt:
        virt_type: kvm

and then labelled a node with compute.domain.example/virt-kvm=true and compute.domain.example/virt-qemu=true, the reconciliation would eventually fail (thanks to CUE) because we specified conflicting values for the same configuration key. This is a user error which must be resolved by the user.

Note

Some config template sections have entries which do not directly map to OpenStack configuration and are not processed by CUE. Their merging startegy is then explained in the corresponding API reference section.

Note

The configTemplates sections have no influence on scheduling; they only influence which configuration is loaded on nodes which are already selected by the corresponding scheduling keys. Still, it is oftentimes useful to re-use the keys of the scheduling keys and redefine the value in your own terms.

Warning

The yaook.cloud domain and its subdomains are under control of the YAOOK project. Do not create your own labels or taints using that suffix, as you may run into severe problems should we at some point start to use the same label as you do.

Use your own domain. If you do not have a domain, you may either use one of the well-known reserved TLDs (test, example, invalid), or if you need to go into production without a domain (even though it’s not clear to us how that could be), you are free to generate a type 4 UUID (with uuidgen --random) and use humanreadablename-<UUID>.yaook.cloud as suffix (substitute humanreadablename with something sensible identifying your setup). The UUID ensures that no conflicts will occur with anything official YAOOK does.

See also

NeutronDeployment.spec.setup.ovn.controller.configTemplates[index]

Reference for the OVN controller configTemplate

NeutronDeployment.spec.setup.ovs.l2.configTemplates[index]

Reference for the OpenvSwitch L2 agent configTemplate

NeutronDeployment.spec.setup.ovs.l3.configTemplates[index]

Reference for the OpenvSwitch L2 agent configTemplate

NeutronDeployment.spec.setup.ovs.dhcp.configTemplates[index]

Reference for the OpenvSwitch DHCP agent configTemplate

NovaDeployment.spec.compute.configTemplates[index]

Reference for the Nova Compute configTemplate

Immutability

All configuration assembled by the Operators is stored in Kubernetes Secrets for use in Pods, and those Secrets are marked as immutable.

This has two key advantages:

  1. It improves kubelet performance because kubelet does not need to watch for changes on those secrets (to update them in the Pods).

  2. Configuration cannot be changed without a re-rollout of a workload controller (Deployment, StatefulSet, or similar).

The latter is important to control the rollout of new configuration. Imagine someone changed their Nova configuration to add a new scheduler filter and accidentally mistyped the name of the scheduler plugin. This would cause the scheduler to crash with nova.exception.SchedulerHostFilterNotFound: Scheduler Host Filter foobar could not be found.. If the Secret was not immutable but updated in-place, a random restart of the scheduler would cause it to load this faulty configuration and enter a broken state.

With the Operator approach of having immutable configurations with random names (via generateName), new configuration is only ever loaded when a new Pod is created (via the controller for the Deployment/StatefulSet). Those will generally avoid tearing down old replicas if the new replicas don’t become ready, thus avoiding fatal breakage.

The old configuration Secret will automatically be deleted once it is not in use anymore by the Pods.

A downside is that you cannot manually in-place edit the configuration Secret and expect it to be updated in the Pods for ad-hoc debugging actions. This turns out to be rarely needed; if you absolutely need it, you can pause the resource, copy the Secret into a new object and edit the workload controller manually.

Day-two Operations

At this point, you know a lot about how YAOOK works. As you can see, there is a lot of complexity in there. Even though some things are simpler than in other configuration management systems (the thing where we don’t have multiple config precedence levels, mostly), many things are more complex and there are more moving parts. So what is the gain?

YAOOK Operator runs continuously and gets live information about the cluster status via the Kubernetes. This allows it to do two things:

  1. It can react to changes in system state which occur during rollouts.

  2. It can take autonomous action without involvement of a human operator if the situation is “clear”.

As a side-effect of the first, it can also more easily execute complex orchestrated operations, such as release upgrades and node draining.

Rolling Out Changes

Whenever you modify a top-level resource (or, indeed, any CR managed by YAOOK), a rollout is triggered. The rollout proceeds until it eventually reaches its goal of applying the changes to all child resources.

As briefly mentioned in Configuration Concepts, YAOOK Operator keeps all configuration it creates immutable. That means that whenever configuration changes, a rolling restart of all consumers of that configuration is executed.

This is the most resilient way of deploying new configuration, as the workload resources are only considered ready by the statemachine (see the explanations in The YAOOK Statemachine) if all of its pods are ready and up-to-date. That means that faulty configuration will bring the rollout to a halt on the first issue and thanks to how Kubernetes Services ignore unready endpoints, the impact will be limited.

The same logic also holds for the subservice resources (see OpenStack as Custom Resources): Things like Nova Compute Nodes are not reconfigured in place but will be recreated (according to their disruption budget, see below).

Rollouts may get stuck for various reasons, which will be visible in the resource status as desribed in Reconciliation Process. When a rollout gets stuck, manual intervention is required to make it proceed.

  • If a resource is in the same WaitingForDependency for a long time (with only brief Updating states inbetween), you need to investigate the components referenced in that state to understand why they are not becoming ready. For inspiration on how to do that, see The Component Graph.

  • If a resource ends up in InvalidConfiguration, you need to look at the error message and fix the issue it is reporting. Oftentimes this will be a type mismatch in the configuration or an attempt to set two conflicting values.

  • If a resource ends up in BackingOff, you may have a more severe issue. Check the error message as it is generally unexpected.

  • If a resource is updated successfully, but its subservices (Nova Compute nodes etc.) are not getting rolled out, check the status of these subresources. If they get stuck, most of the time they get stuck in eviction, which we’ll discuss next.

Eviction

Eviction is what we call the process when all user workload needs to be removed from a node. This is similar to a Kubernetes node drain, however, it works on the IaaS layer and not on the Kubernetes layer.

This distinction is important: Unfortunately, there is no absolutely reliable way to detect a node drain (as opposed to a node cordon, which sets a taint). That means that the Operator cannot know when you issue a kubectl drain and cannot act accordingly. For you, as a user, that means you cannot use kubectl drain to clear a node running Nova Compute or Neutron Agents. Attempting to do so will in the best case block, in the worst case disrupt customer workload.

Instead, you need to first remove the workload by unlabelling the node (see Scheduling: Labels & Taints). This will make YAOOK Operator gently remove the services from the node using the process we call eviction.

Not all YAOOK resources are elegible for eviction. It only affects stateful, non-replicated services like Neutron agents and Nova Compute services. The following is a complete list of resources which perform eviction:

  • NovaComputeNode

  • NeutronL2Agent

  • NeutronL3Agent

  • NeutronDHCPAgent

  • NeutronBGPDRAgent

Note

As you may notice, OVN resources do not require eviction, as the OVN layer is resilient and fast enough to fail over routers in the timeframe the OVN controller gets from Kubernetes for a clean shutdown.

In the future, mechanisms to distribute the failed over workload more evenly may be added, but currently it does not seem like those will use the eviction mechanism.

Eviction is triggered whenever the service is deemed unfit to continue to service customer workload by the Operator. The following reasons exists for an eviction:

Reason

Description

NodeDown

The node was deleted from the Kubernetes API or got taints indicating that Kubernetes detected an issue with the node (NotReady).

DesiredState

Nova Compute Nodes have a spec.state field which allows the user to set it to DisabledAndCleared. Evictions caused by that use this reason.

Deleting

Either the custom resource or the Kubernetes Node itself are being deleted (deletionTimestamp set). This happens during rolling upgrades.

Note

Once started, an eviction cannot be safely stopped by the user until it has completed.

The reason can be obtained from the status.eviction field. If it is null, no eviction is currently going on. During an ongoing eviction, yaookctl status openstack will show the progress.

The eviction process blocks the resource from being deleted (via a finalizer) until all resources have been successfully removed. This in turn may block a rollout of a new version or configuration.

The eviction itself is run as a Kubernetes Job. You can find that job by looking for evict or by running kubectl -n yaook get job -l state.yaook.cloud/parent-name=.. (inserting the name of the CR which is being evicted). You can then inspect the Jobs logs for details on the progress.

When an eviction gets stuck, there are several possible reasons. Here are a few of them (there may be other special broken cases you then need to investigate separately):

  • There is no space for the workload elsewhere: The workload may in this case be marked as stuck in the yaookctl status openstack output [3]. You then need to make space available.

  • Live (or offline) migration for VMs is broken. The symptoms of this are VMs enter MIGRATING state, but then appear ACTIVE again on the next iteration of the job log. In this case, the compute logs of the source and destination hosts are most interesting to look at.

  • The node is severely broken and cannot handle migration requests. In this case, manual intervention is required to force-down the node in order to allow an evacuation (which loses data and is thus not done automatically by YAOOK).

If an eviction gets stuck and cannot be resolved, there are two options: Either you can terminate the workload more or less gracefully to remove it from the node, or you can force YAOOK to proceed without eviction (only for evictions which are not triggered by setting the DisabledAndCleared state).

For this, there is the yaookctl force-upgrade command. This command is dangerous. What it will do, technically, is to remove all finalizers from the object you run it against, as well as issuing a deletion if necessary. This will cause for instance Nova Compute to be deleted from a node.

Often, this works out fine, Nova Compute will eventually return (unless you just removed the compute label from the node) and it will even find the (presumably) running VM. However, this is never safe; all kinds of things may go wrong during this: You could be rolling out a broken configuration which prevents nova-compute (or Neutron layer 2) from starting; The magic which allows Nova Compute to find the VM again could fail; You might’ve mistaken the reason an eviction is happening in the first place and the node is being deleted now; Demons could come flying out of your nose.

You should only use force-upgrade as a last resort: when the only other option would be deleting the workload.

Disruption Budgets

For the subservices, it is possible to set up YAOOK Disruption Budgets to define how many of them may go down at the same time. By default, YAOOK will only ever take one such service down at the same time. If you define disruption budgets for groups of nodes, you can allow more (or less!) than one service to be taken down at the same time.

YAOOK Disruption Budgets work on node labels (not workload labels, in contrast to Pod Disruption Budgets). The syntax is rather simple:

apiVersion: yaook.cloud/v1
kind: YaookDisruptionBudget
metadata:
    name: large-interchangable-hypervisor-group
    namespace: yaook
spec:
    maxUnavailable: 10%
    nodeSelectors:
    - matchLabels:
        yaook.domain.example/hypervisor-type: common

Such a disruption budget would allow up to 10% of the subservices on nodes matching yaook.domain.example/hypervisor-type=common going down at the same time. This can tremendously speed up upgrades if you have hundreds of nodes.

A node can only be part of a single YaookDisruptionBudget; if you attempt to assign a node to more than one such budget, the corresponding Operator will fail the reconciliation with an InvalidConfiguration status and a message detailing which budgets overlap.

Warning

If you specify a disruption budget which allows for too much disruption at the same time, you may enter an eviction deadlock. For instance, with the above budget, if your cloud was to 95% full, live migration off the to-be-evicted nodes is bound to fail with “no valid hosts found”. As the nodes are not getting cleared out, the rollout would get stuck.

Upgrades

YAOOK Operator natively supports upgrading between OpenStack releases. It is as simple as just setting a new targetRelease on the corresponding CR.

Not all releases are supported by YAOOK Operator, nor is an online upgrade between releases supported for all releases. If you attempt an unsupported upgrade operation, the operator will abort the rollout with InvalidConfiguration. You can then safely return the targetRelease to its previous value.

To observe the upgrade process, yaookctl status openstack is the recommended tool. You can also manually inspect the status.installedRelease, status.nextRelease and status.conditions fields.

During upgrades, special operations like database schema upgrades are executed according to the OpenStack release notes. In addition to that, an upgrade behaves like a configuration change affecting all services. This means in particular that all your Nova Compute services and Neutron agents will see a rolling restart.

In addition, infrastructure services (such as MySQLServices) belonging to an OpenStack service will be upgraded according to the supported releases of the respective OpenStack release. These will also be upgraded in a rolling fashion and minor-by-minor in order to ensure that no availability or data loss occurs.

Tracing

Tracing can be very useful for performance debugging/profiling of the operator.

With tracing enabled (YAOOK_OP_TRACING_ENABLED) the operator will generate a trace per Custom Resource reconcile. This trace consists of a span per each major time consuming function.

Example output:

  • Trace: neutron_ovn_operator _reconcile_cr - cmp001 - 30s

    • Span: …

    • Span: neutron_ovn_operator _reconcile_ovs_vswitchd_certificate - 3.63s

    • Span: neutron_ovn_operator _reconcile_ca_certs - 226.06ms

    • Span: neutron_ovn_operator _reconcile_ovs_vswitchd - 4.17s

    • Span: …

The target can be any opentelemetry JAEGER compatible destination. The easiest way to trace the operator locally is to spin up the jaegertracing all-in-one contaienr

$ docker run -d -p6831:6831/udp -p6832:6832/udp -p16686:16686 -p14268:14268 jaegertracing/all-in-one:latest

Final Words

Hopefully, this user guide helped you understand YAOOK Operator better. Let us end this journey with a “further reading” block:

API Reference

For a full reference of all custom resources provided by YAOOK with their fields.

Finalizer Reference

A reference on finalizers which exist in YAOOK and what to do if you (accidentally?!) removed them.

Environment Variable Reference

A reference on environment variables read by YAOOK Operator processes.

Helm Chart Values Reference

A reference on values accepted by the Helm Charts YAOOK Helm charts.

Coming soon

A full installation guide.

Glossary

component

A single piece of an Operator state machine. Oftentimes corresponds to a Kubernetes resource. A component has a name by which it is identified for instance in log messages. See The Component Graph for details.

component graph

A directed acyclic graph where the nodes are components and the edges are dependencies. This is the basis on which reconciliation works in YAOOK. See The YAOOK Statemachine for details.

eviction

Graceful removal of stateful workload of nodes which are not suited to run them any longer (because they are being decomissioned or broken). See Eviction for details.

reconciliation
reconciliation run

A single execution of an Operator state machine until no further action is possible (either an error occured or all components updated or non-ready). See Reconciliation Process for details.

rollout

The sequence of one or more reconciliation runs from a change to a custom resource until that resource is back in Updated state.

statemachine

Two meanings:

  • The state machine implementing a custom resource,derived from the component graph of an Operator. See The YAOOK Statemachine for details.

  • The Python package yaook.statemachine which implements the above.