When a Kubernetes control plane fails, the cluster may become unresponsive. If the failure is not detected and corrected quickly, the cluster may enter an unrecoverable state. In this article, we will discuss how to detect and correct a control plane failure in Kubernetes. ..


Kubernetes is the leading orchestrator for distributing container instances across multiple physical nodes. The nodes are managed by the Kubernetes control plane, a collection of components which maintain the cluster’s state, respond to changing conditions, and handle scheduling decisions.

It’s essential to understand the control plane’s role when you’re operating clusters that need consistent availability. In this article, you’ll learn what happens when the control plane fails so you can plan ahead and implement protections.

Understanding the Control Plane

The Kubernetes control plane is responsible for your cluster’s global operations. It coordinates actions that affect your worker nodes. The control plane also provides etcd data storage for the cluster, as well as the API server which you interact with using tools like Kubectl.

Here are some of the control plane’s main responsibilities:

kube-apiserver hosts the Kubernetes API server. kube-controller-manager starts and runs the controllers within your cluster, allowing state changes requested by the API server to be detected and applied. kube-scheduler assigns Pods to worker nodes by determining which node is best equipped to support each new Pod. etcd is a key-value data store that holds all Kubernetes cluster data and state information.

The Kubernetes architecture relies on these components being continually available. They work together to create the normal operating state where everything runs smoothly and your cluster meets expectations.

What Happens During Control Plane Failure?

The control plane is not impervious to failure. The high number of components involved means individual pieces can stop working and cause knock-on effects in your cluster. A component might crash or the physical host running the control plane could suffer a hardware failure.

The actual effects on your cluster will vary depending on the component with the problem. However a control plane failure will usually prevent you from administering your cluster and could stop existing workloads from reacting to new events:

If the API server fails, Kubectl, the Kubernetes dashboard, and other management tools will stop working. If the scheduler fails, new Pods won’t get allocated to nodes so they’ll be inaccessible and show as stuck in the Pending state. This will also affect Pods that need to be rescheduled because a Node’s run out of resources or a scaling request has been sent. When the controller manager fails, changes you apply to your cluster won’t be picked up, so your workloads will appear to retain their previous state.

Control plane failures prevent you from effectively modifying cluster state. Changes will either fail altogether or have no effect inside the cluster.

What About Worker Nodes and Running Pods?

The control plane is a management layer that sits above and spans across your worker nodes and their Pods. The control plane and the workers are independent of each other. Once a Pod’s been scheduled to a node, that node becomes responsible for acquiring the correct image and running a container instance.

This means failures in the control plane won’t necessarily knock out workloads that are already in a healthy state. You can often continue accessing existing Pods, even when you can’t connect to your cluster with Kubectl. Users won’t necessarily notice a short-term control plane outage.

Longer periods of downtime increase the probability that worker nodes will start to face issues too. Nodes won’t be able to reconcile their state so inconsistencies could occur. Networking can also start to break up if DNS isn’t working and the contents of cached requests expire.

A failure can become more serious if a worker node starts to experience problems while the control plane is down. In this situation Pods on the node may stop running but the rest of the cluster will be oblivious to what’s happening. It’ll be impossible to reschedule the Pods to another node as nodes operate independently in the control plane’s absence. This will cause your workload to drop offline.

Avoiding Control Plane Failure

You can defend against control plane failure by setting up a highly available cluster that replicates control plane functions across several machines. In the same way you use Kubernetes to distribute and scale your own containers, you can apply high availability (HA) to Kubernetes itself to increase resiliency.

Kubernetes offers two mechanisms for setting up an HA control plane implementation:

Using “stacked” control plane nodes. – This approach requires less infrastructure and works with a minimum of three machines. Each machine will run its own control plane that replicates data from the others. One host will assume responsibility for the cluster by being designated as the leader. If the leader goes offline, the other nodes will notice its absence and a new leader will be elected. You ideally need an odd number of hosts, such as 3, 5, or 7, to optimize the election process. Using an external etcd datastore. – This approach is similar to the stacked model but with one key difference. It relies on an external etcd instance which will be shared by your control plane nodes. This can avoid wasted data replication. You should consider manually setting up replication of the etcd cluster so it doesn’t become a separate point of failure.

Kubernetes now has good support for clusters with several control planes. If you administer your own cluster, you can add another control plane node by simply including the –control-plane flag when you run the kubeadm join command:

Summary

The Kubernetes control plane is responsible for maintaining cluster-level operations. It oversees your worker nodes, handles API requests, and applies actions inside the cluster to achieve your desired state.

When the control plane goes down, these functions will be unavailable but you should be able to continue using existing Pods for a limited period. The control plane is what stitches nodes together to form a cluster; without it, nodes are forced to operate independently, without any awareness of each other.

As the control plane’s a centralized single point of failure, mission-critical clusters need to replicate it across multiple master nodes to maximize reliability. Multi-master clusters distribute cluster management functions in a similar way to how worker nodes make your containers highly available. Although they can be trickier to set up, the extra redundancy is worth the effort. A highly available control plane is also offered as a feature of many cloud providers’ managed Kubernetes offerings.