Safe eviction
Motivation
Avoid data loss at all cost
Avoid data plane disruption wherever possible
Recover from data plane disruption
Affected Services
Nova Compute
Neutron L3 Agent
Neutron DHCP Agent
Neutron BGP Agent
Neutron L2 Agent
OVN Agent
OVN BGP Agent
Surroundings / Kubernetes Tools
Finalizers
Finalizers DO prevent deletion of an object from the API
Finalizers DO NOT prevent termination of containers in a Pod
Container Lifecycle Hooks
preStop hook allows to execute code inside a container or send an HTTP request to it
execution time bounded by the termination grace period (per-Pod setting or deletion request setting)
Docs say:
Users should make their hook handlers as lightweight as possible. There are cases, however, when long running commands make sense, such as when saving state prior to stopping a Container.
Node Draining
Implemented by adding Evictions for matching Pods
Protection of DaemonSets is HARDCODED in kubectl against the DaemonSet resource!! -> Our CDSes are unprotected and will be evicted by Drain!!
Not useful at all for our use cases
Implementation
Nova Compute
Finalizers on Pods are not sufficient to protect anything worth saving
We need to prevent the CDS from descheduling Pods until the node has been evicted
Approach:
Subclass
ConfiguredDaemonSetState
to prevent descheduling of nodes which still have state which needs to be savedTrack nodes with state using finalizers? or annotations?
On reconcile, trigger task (in the queue? how?) which performs the eviction (via OpenStack API)
Once eviction is complete, trigger reconcile which will then allow the node to be cleared
If the node is unreachable (detect how? OpenStack API + k8s pod/node status?), do hard eviction and delete pod parallely. If a ironicNodeShutdown is specify in the eviction, the node gets shutdown.
Challenges:
Trying to mess with lifecycle management of CDS is probably not a wise idea
Need to keep track of the state the nodes are in separately (where?)
To overcome those challenges, we decided to split the Nova Operator into two
Operators: One for the big picture (nova
) and one which manages the
individual compute nodes (nova_compute
). This also means that there is a
new resource (NovaComputeNode
).
Neutron L2 Agent
Like Nova Compute, Finalizers on Pods don’t help us here
Components that depend on
NeutronL2Agent
areNeutronL3Agent
,NeutronDHCPAgent
,NeutronBGPDRAgent
,NovaComputeNode
Approach to safely schedule or evict them:
Use annotations and labels on the nodes running the agents/services
L2 operator sets label maintenance.yaook.cloud/maintenance-required-l2-agent: False after the L2 agent has been successfully created.
If L2 agent needs to be removed (e.g. for updating configuration) the operator set maintenance.yaook.cloud/maintenance-required-l2-agent: True
Operators (nova, neutron) creating resources needing L2 agent, won’t schedule them on nodes which don’t have the label maintenance.yaook.cloud/maintenance-required-l2-agent: False (so either, no annotation or value True, will lead to not scheduling resources on the node)
Operators responsible for agents/services needing L2 agent set an annotation l2-lock.maintenance.yaook.cloud/*: ‘’ to the node, at the very beginning of reconcile
The annotation l2-lock.maintenance.yaook.cloud/*: ‘’ will be removed, after the agent/service is deleted. This way, each agent/service can be safely evicted before.
L2 operator waits till all l2-lock.maintenance.yaook.cloud/*: ‘’ annotations got removed from the node. Before that, the L2 agent won’t be touched by the operator
Once all l2-lock.maintenance.yaook.cloud/*: ‘’ annotations are gone, L2 operator will delete the L2 agent
After L2 agent is updated/recreated, the label maintenance.yaook.cloud/maintenance-required-l2-agent: False is set again
We decided to retain the maintenance-required annotation on the node, even after the L2 agent has been deleted. That way, if the L2AgentResource doesn’t got deleted right away by k8s and the pods are still there, other operators still see that there is a maintenance required.
Implementation details:
We added a
L2Lock
that will be used by each operator needing L2, to set the l2-lock.maintenance.yaook.cloud/*: ‘’ annotation.Introduced subclass
L2AwareStatefulAgentResource
from StatefulAgentResource that each agents resource inherits from, that needs L2. It is used to check if label maintenance.yaook.cloud/maintenance-required-l2-agent: False is set, so agent/service can be scheduled on the node.L2 operator has it’s own
L2StateResource
instead of inheriting from APIStateResource so the specific behavior can be implemented there. This class adds the label maintenance.yaook.cloud/maintenance-required-l2-agent: False to the node after L2 agent is created and changes it to True on deletion. It also waits, till all the maintenance locks are gone from the node.
Neutron OVN Agents
Components that depends on OVNAgent
are NovaComputeNode
and
OVNBGPAgent
.
For scheduling the dependent agents on the node we use the same mechanisim
and the exact same annotation i.e
maintenance.yaook.cloud/maintenance-required-l2-agent: False
in order to make sure the meets the prerequisite of OVNAgent
before they
are deployed.
Similar to the Neutron L2 agent, OVNAgent dependencies, NovaComputeNode
and
OVNBGPAgent
makes use of l2-lock.maintenance.yaook.cloud/*: ‘’, in order
to inform the Neutron operator to not evict the OVNAgent
from the node,
unless the locks are released.