Running Global Health Checks on Kubernetes Clusters

Learn how to run global health checks on Kubernetes clusters to optimize performance, prevent disruptions, and improve scalability.

Patrick Londa
Author
Feb 28, 2022
 • 
 min read
Share this post

Running a Kubernetes global health checklist can go a long way in preventing errors before they cause disruptions, and can optimize container performance according to current scalability needs.

In this guide, we're outline why monitoring your Kubernetes cluster health should be a part of your DevOps strategy, and the steps you can take to check your cluster health.

Why It Matters

Consisting of a master node, at least one worker node, and all of the containers and Pods inside, a cluster comprises an entire workload for a given app development team — or for the entire project. The ability to configure multiple clusters according to the needs of each department enables developers to optimize the resources that they invest into creating their apps.

For example, a machine learning application may require a graphics processing unit (GPU) to function, which would not be necessary for other operations like web service. Configuring a Kubernetes cluster to the needs of each department would enable developers to use only the resources they need for each project, and none that they don't. That means failure to customize the operation of  each Kubernetes cluster can result in suboptimal configuration, which can hinder app development.

The Key Criteria for Kubernetes Cluster Health

Following a Kubernetes global health checklist can help DevOps teams monitor their clusters' health, ensuring that each one runs at optimum capacity. Here are a few cluster events to watch for:

1. Resource Allocation

Both cluster nodes and Pods have minimum and maximum amounts of CPU and memory usage that they can consume. The minima are called requests, which impact the scheduler as it uses requests to select Pods for eviction from a node under pressure. The maxima, called limits, are used at the container runtime level. They prevent the container from using more than that limit, ending in a CrashLoop most of the time.

CPU ranges are considered compressible, so exceeding them will only cause container usage to be throttled. Memory is the amount of data consumed by each container, so containers operating outside the request and limit range will be terminated. Therefore, it is important to assign an appropriate request and limit range for both CPU and memory usage to each pod within a cluster. Otherwise, a container may be throttled or terminated.

2. Percent Usage

Once you have established the request and limit ranges for both the CPU and memory use, it is important to identify how much is consumed by each node and pod. This can be done by evaluating three parameters for both CPU and memory use: percent usage, percent requested, and percent limits.

A low usage rate means that you have allotted more computing power or data than needed, and could save by scaling back your limit. A higher usage percentage means that you may be operating close to full capacity, and could struggle to scale or keep up with greater loads. If the percent limit is lower than percent requested, then you may not have assigned a limit to all of your Pods.

The Kubernetes metric, kube_node_status_allocatable, helps developers identify how many additional Pods can be added based on current CPU and memory usage trends. That way, developers will know how much room they have to scale.

3. Distribution

In addition to making sure that each node and pod is operating within the assigned computing limits, DevOps teams should also keep Pods relatively evenly distributed across all nodes.

An uneven distribution can result in some loads being overloaded and their containers possibly terminated, while the computing power available in other nodes goes unused. This can be due to node affinity, where a certain property like GPU possession or security features causes a disproportionate number of Pods to be scheduled to it. Conversely, some node features called taints may repel pod assignment, leaving them with fewer Pods than their capacity allows.

To get the most out of the computing power available, check your affinity settings to make sure no Pods are disproportionately scheduled to certain nodes.

Blink Automation: Run a Global Health Check on Your Cluster
Blink + Kubernetes
Try This Automation

Running Health Checks with the Kubernetes Server API Endpoints

The Kubernetes server has three API endpoints that can be used during a global health check. They are:

  • Healthz, which determines if the app is running, but this has been deprecated since v1.16
  • Livez, which can be used with the flag --livez-grace-period to determine startup duration
  • Readyz, which re-launches containers if they are terminated

If a machine checks the healthz / livez /readyz of the API server, it should examine the HTTP status code, as a status code 200 indicates the API server is healthy / live / ready, depending on the called endpoint.

When developers want to manually debug the status of the API server, they can run this command with the verbose parameter:

kubectl get --raw='/readyz?verbose'

The output then shows the full status details for the endpoint:


[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

For more information on this type of debugging, you can read more here.

Keeping your Kubernetes cluster at optimum performance will prevent you from wasting allotted computing power, and valuable business resources too. It will also improve scalability, and will enhance efficiency across the board. Integrate this Kubernetes global health checklist into your DevOps strategy, and improve your applications today.

Automating Kubernetes Health Checks with Blink:

Running a health check with kubectl commands isn't hard, but it requires context-switching. If you want to run a health check regularly or trigger one after an alert, a little automation can save you significant time.

For example, you can use this automation out-of-the-box from the Blink library.

Blink Automation: Run Kubernetes Cluster Health Checklist
Blink Automation: Run Kubernetes Cluster Health Checklist

This Kubernetes health check automation does the following steps:

  1. Gets a list of abnormal events.
  2. Checks node resources and conditions.
  3. Checks pod resources and conditions.
  4. Checks live, ready, and healthy APIs.

It's a simple automation, and that makes it easy to customize. For example, you can schedule it to run regularly or send the report information to a Slack channel or email.

You can get started with over 5K automations in the Blink library, or build your own custom automations to fit your unique workflow.

Get started with Blink today to see how easy automation can be.

Expert Tip