LOONA Cluster

Following my decision to migrate away from Contabo, I began planning my next infrastructure iteration. Before committing to Ampere-based hardware for cost savings, I needed to validate that my existing workloads would run on arm64 without issues. So, I provisioned a 12-node Kubernetes cluster using exclusively arm64 instances from Netcup: three control plane nodes and nine workers.

Requirements

In short, this is simply a recreation of the existing workloads in the current Contabo cluster.

This environment will be built entirely on bare metal, intentionally avoiding external managed dependencies to retain full control over reliability, cost, and failure modes. It will need to include automatic persistent volume provisioning (Longhorn) and centralized secrets management (OpenBao). Observability is treated as a first-class concern, with a full stack comprising Grafana Loki, Tempo, Mimir, and Pyroscope, complete with collection via Alloy and Beyla. Supporting infrastructure services such as Redis, PostgreSQL, and Apache Kafka are included to enable stateful workloads and event-driven architecture. Policy enforcement is handled through both Kyverno and Gatekeeper, covering cluster-wide constraints such as resource limits, security contexts, and workload governance, as I've described in prior writing on Kyverno/Gatekeeper architecture.

Design

The previous cluster design suffered from nodes accumulating multiple overlapping responsibilities, which made incident response unnecessarily dependent on recalling runtime state from memory. This introduced avoidable cognitive overhead when alerts surfaced, since node identity did not reliably map to a single mental model. The revised approach optimizes for direct node-name-to-intent recognition by leveraging an existing human memory structure as a stable lookup layer for system understanding: a naming and grouping scheme based on the members of the K-pop girl group LOONA.

In this model:

subunits (LOONA 1/3, Odd Eye Circle, yyxy) map cleanly to node pools
subunit leaders (heejin, kimlip, yves) function as control-plane equivalents
individual members represent worker nodes with clearly scoped responsibilities

All of this combines to make cluster topology and ownership structure immediately legible. Unlike vanity naming schemes I've encountered before in multi-hundred-node clusters (planets, mythology, Star Wars), this encoding is systematic: member names directly maps to infrastructure topology, which encodes operational information:

the yyxy subunit (4 members) maps cleanly to the three-instance Apache Kafka broker requirement
Odd Eye Circle (3 members, effectively 2 non-lead nodes) fits workloads that rely on failover rather than strict HA such as operators and step-ca
LOONA 1/3 (5 members) aligns with highly available, zone-distributed systems like Grafana Mimir ingesters that must tolerate node loss with rescheduling capacity

When kimlip has issues in monitoring, I immediately know: it's a control plane node in the ODD EYE CIRCLE zone; overall cluster health is likely fine. chuu is down completely? I know to expect the knock-on effects of partition leader re-election and increased metric ingest time end-to-end. With generic names like k8s-node-7, one loses that context.

Of course, this information is also encoded in Kubernetes labels (role=control-plane, nodepool=odd-eye-circle, etc.) for proper workload scheduling and automation. But labels are for machines; names are for humans staring at dashboards during incidents.

Cluster Setup

I've written about Kubernetes setup with Talos Linux in detail (Private) before, so I won't rehash that here. The main difference this time, I'm using OpenTofu instead of applying configs directly via talosctl.

Using OpenTofu for Talos configuration gives you the usual IaC benefits, but a few things are particularly nice:

Autogenerated configs as outputs. Run terraform apply and get your talosconfig and kubeconfig ready to use immediately—no separate generation steps or manual copying.
Loop over node pools with for_each. Instead of running talosctl apply-config 12 times, define your node pools once and let Terraform handle the iteration. Control plane nodes, worker nodes across zones—all managed declaratively.
Patches as code. Custom PKI, registry mirrors, sysctls, whatever—it's all versioned alongside your infrastructure definitions. Changes are reviewable, repeatable, and don't rely on remembering which talosctl patch commands you ran six months ago.

Here are the key patches I'm applying to the base Talos configuration:

The custom PKI and DNS configuration for resolving my .testlab.kube domain (Private):

yamlencode({
    apiVersion = "v1alpha1"
    kind       = "TrustedRootsConfig"
    name       = "testlab-key"
    certificates = <<-EOT
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
        EOT
    machine = {
        network = {
            nameservers = [
            "xx.xx.xx.xx"
            ]
        }
    }
}),

Registry mirrors are configured to use my Harbor instance on the old cluster:

# nested under machine.registries
mirrors = {
    dockerhub = {
        endpoints = [
            "https://harbor.prod.service.testlab.kube/v2/proxy-cache-dockerhub"
        ]
        overridePath: true
    }
    "docker.io" = {
        ...
    }
    # repeat for gcr/ghcr, quay, etc
}

The network interface configuration specific to my hosting provider:

# nested under machine.network
interfaces = [
    {
        interface = "enp9s0"
        addresses = ["${each.value.internal_ip}/24"]
        routes = [
            {
                network = "10.0.0.0/8"
                gateway = "10.132.0.1"
            }
        ]
    },
    {
        interface = "enp7s0"
        routes = [
            {
                network = "0.0.0.0/0"
                gateway = "xx.xx.xx.xx"
            }
        ]
    }
]

Bootstrapping Initial Workloads

The bootstrapping process is largely the same as what I've covered in previous posts, so I'll keep this brief.

Helmfile is still my tool of choice for initial cluster setup—declarative, version-controlled, and handles dependencies cleanly. The main change: I'm using Cilium this time instead of Istio. I want the eBPF-based networking and observability capabilities without the service mesh overhead. For my workloads, Cilium's kernel-level visibility and performance characteristics are a better fit.

ArgoCD bootstrapping remains unchanged. Helmfile gets ArgoCD onto the cluster, then ArgoCD takes over for everything else. Once it's running, the cluster manages itself.

Performance Improvements

Once workloads migrated over, the improvement was immediately noticeable.

The most measurable improvement shows up in Mimir metrics. P99 query latency dropped from ~1 second with regular 5-second spikes down to 100-200ms with minimal spiking. More importantly, P50 and average latency converged—they're now nearly identical and stable, because the P99 isn't dragging the average up anymore.

Retry storms previously observed between Grafana Alloy and constrained Apache Kafka CPU allocations were fully mitigated through improved backpressure handling and resource isolation. In parallel, elevated I/O wait conditions were eliminated, which materially reduced tail latency—most notably improving p99 performance for metrics ingestion paths into Grafana Mimir.

Closing Thoughts

Building a cluster with a clear vision is always cleaner. Capacity planning from the start, naming conventions that encode operational context, declarative infrastructure with OpenTofu, and actual hardware that doesn't wait for hundreds of milliseconds on a slow disk - it all adds up.

The metrics speak for themselves, but beyond the numbers, there's something satisfying about infrastructure that just works.

Backlinks