Stabilizing Kubernetes via Policies

In my resource-constrained cluster consisting of 12 workers (ranging from 4 to 12 CPU cores), cascading failures due to NotReady nodes were a recurring risk: when one node went down, its pods were rescheduled onto other nodes, sometimes pushing them over capacity and causing additional nodes to fail. Manually adding CPU and memory limits to every pod—particularly in a Helm-driven environment where charts often omit them—was both time-consuming and error-prone, making a scalable, automated solution essential.

Goals

The objective of this effort was to ensure cluster stability by enforcing consistent resource usage policies with minimal operational overhead. Specifically, the solution needed to:

Guarantee that every pod running in the cluster has CPU and memory limits set, with few exceptions.
Automate the creation and enforcement of these limits, removing the need for manual intervention.
Provide alerts when default limits are likely insufficient, prompting review before they cause issues.
Operate transparently, requiring no additional work from developers or operators once configured.

Challenges

This solution must address two distinct aspects of pod lifecycle and stability:

Eviction Behavior

Pods must be classified by their tolerance to node pressure. The system allows operators to declare which workloads are safe to evict when a node approaches memory or CPU exhaustion. Eviction policies are applied automatically, ensuring critical pods remain running while non-essential workloads are terminated predictably.

OOMKill Behavior

Memory limits can trigger Linux OOM kills if the node becomes oversubscribed. Burstable pods, in particular, may be terminated due to their oom_score_adj settings, even if their usage is within reasonable bounds. The solution enforces consistent limit policies and adjusts pod priorities to minimize unnecessary OOM kills while maintaining predictable node behavior.

Solution Approach

The solution enforces stability by classifying pods into priority tiers, dictating eviction and OOMKill behavior. Pods are grouped as:

High Priority: critical system components and telemetry/metrics generators (e.g., Longhorn, kube-system pods, Grafana Beyla, OpenTelemetry Collector). These should be preserved under nearly all circumstances, and are deployed as a DaemonSet. Therefore, if they are evicted, they cannot be rescheduled onto a different node.
Medium Priority: stateless, replicated workloads such as metrics ingesters or database replicas. Individual pods may be evicted without affecting overall system behavior.
Low Priority: auxiliary components with low activity or resource demands (e.g., certain webhooks, cert-manager). These can be safely terminated or rescheduled, as they are stateless and designed to tolerate temporary downtime.

Considering the goals and the tiered priorities, the solution requires both mutation and validation of pod resources. Kyverno is used to perform mutations—building on its existing role for injecting CA certificates—to automatically apply CPU and memory limits as well as QoS classifications. OPA/Gatekeeper is employed for validation, ensuring that all pods comply with defined policies and that critical goals, such as preserving QoS classes and respecting priority tiers, are consistently enforced across the cluster.

Implementation

The implementation focuses on keeping the system “low-touch,” minimizing operational overhead while enforcing consistent policies. By default, pods are treated as low priority, only adding a simple annotation when a pod should be classified as medium or high priority. All other behavior—limit mutation, QoS assignment, and policy validation—is handled automatically by the Kyverno and Gatekeeper.

High-priority pods are preserved under nearly all conditions. The implementation enforces this as follows:

High Priority

A PriorityClass is defined for high-priority workloads:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-preempt-non-system
value: 999999
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Preempt all non-system workloads but not system pods"

A Kyverno mutation policy triggers only for pods annotated as high-priority. If CPU or memory limits are not explicitly set, the policy sets them equal to the pod’s requests:

    - name: set-default-limits
      match:
        ...
      mutate:
        foreach:
          - list: "request.object.spec.containers"
            patchStrategicMerge:
              spec:
                containers:
                  - (name): "{{ element.name }}"
                    resources:
                      requests:
                        +(cpu): "{{ element.resources.requests.cpu || '100m' }}"
                        +(memory): "{{ element.resources.requests.memory || '150Mi' }}"
                      limits:
                        +(cpu): "{{ element.resources.limits.cpu || element.resources.requests.cpu || '100m' }}"
                        +(memory): "{{ element.resources.limits.memory || element.resources.requests.memory || '150Mi' }}"

The Kyverno policy automatically populates CPU and memory requests and limits by default, while still allowing developers to override them if specific values are needed.

And finally, a Gatekeeper/OPA ConstraintTemplate and corresponding Constraint ensures that all high-priority pods comply with the defined QoS class. Pods violating this policy are rejected, guaranteeing that critical workloads always retain their intended priority.

Medium Priority

Medium-priority pods are stateless, replicated workloads that can tolerate individual evictions. Implementation is similar to High Priority:

A PriorityClass is defined with a lower value than high-priority pods:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: medium-priority
value: 9999
preemptionPolicy: PreemptLowerPriority
globalDefault: false

A Kyverno mutation policy sets CPU and memory requests and limits only if they are not already defined. And finally, a Gatekeeper constraint template and constraint enforce that these pods remain in the burstable QoS class, ensuring consistent behavior while allowing oversubscription without risking critical workloads.

Low Priority

Low-priority pods require no mutation or validation, as they are the default classification. Pods that happen to land in the burstable or guaranteed QoS classes are not blocked by Gatekeeper, avoiding unnecessary friction. Because their implicit priority is lower than any explicitly labeled medium or high-priority pods, these workloads are naturally the first candidates for eviction or OOMKill under resource pressure.

Alerting

To ensure that default resource limits remain appropriate, alerting was integrated into the cluster. Alerts already exist via Mimir, with rules automatically registered through Alloy, providing a centralized and GitOps-friendly workflow. Container-level metrics from the kubelet’s /metrics/cadvisor endpoint, along with kube-state-metrics, are scraped, giving visibility into CPU and memory usage for every pod.

A PrometheusRule was created to trigger alerts when actual usage approaches or exceeds configured limits:

      sum by (namespace, pod, container) (
        rate(container_cpu_cfs_throttled_periods_total{container!=""}[5m])
      )
      /
      sum by (namespace, pod, container) (
        rate(container_cpu_cfs_periods_total{container!=""}[5m])
      )
      >= 0.8

Additionally, the Kyverno mutation policy applies a label to pods when it sets resource limits. This ensures that alerts only fire for pods with explicitly configured limits, avoiding false positives for pods that are simply sharing default values.

Results / Impact

Implementing automated resource limits significantly increased overall cluster stability. Since deployment, nodes entering the NotReady state due to resource exhaustion have not occurred.

Certain side effects were observed: a number of existing pods, when restarted, did not initially have sufficient resources to operate optimally. This outcome was anticipated—alerts triggered for these pods allowed us to identify exactly which workloads required adjustment. Using historical usage data collected via cAdvisor, manually tuning these limits was straightforward, and there was no long-term impact on cluster stability.

Moving forward, the only minor trade-off is that newly installed workloads may experience temporary issues, such as CPU stalls or OOM kills, even when correctly configured. This is an acceptable compromise, as maintaining cluster-wide stability remains the priority.

Backlinks

Metrics