LOONA Cluster
Following my decision to migrate away from Contabo, I began planning my next infrastructure iteration. Before committing to Ampere-based hardware for cost savings, I needed to validate that my existing workloads would run on arm64 without issues. So, I provisioned a 12-node Kubernetes cluster using exclusively arm64 instances from Netcup—three control plane nodes and nine workers.
Intentional Naming
My first cluster grew organically—named after various K-pop girl groups, expanding with new node pools as I discovered more workloads to run. It worked, but for this rebuild I'm starting with proper capacity planning: a 12-node cluster sized for what I actually need.
Keeping the spirit of the prior cluster, I chose to name nodes after members of LOONA. Unlike decorative naming schemes I've encountered before (planets, mythology, Star Wars), this encoding is systematic: member names directly maps to infrastructure topology, which encodes operational information:
- Subunits = Availability zones (LOONA 1/3, ODD EYE CIRCLE, yyxy)
- Subunit leaders = Control plane nodes (
heejin,kimlip,yves) - Members = Worker nodes within their respective zones
When kimlip has issues in monitoring, I immediately know: it's a control plane node in the ODD EYE CIRCLE zone.
I can quickly assess whether it's a single node problem, zone-wide networking issue, or control plane instability.
With generic names like k8s-node-7, you lose that context.
This also makes HA configurations more intuitive. Setting up Mimir with zone-aware replication? I can immediately verify ingesters are distributed across LOONA 1/3, ODD EYE CIRCLE, and yyxy zones without cross-referencing documentation.
This matters at 2am.
When you're half-awake troubleshooting an outage, yves failing immediately surfaces its role and location without decoding arbitrary numbers or looking up mappings.
The human brain retrieves associated information—subunit, position in the group, relationships to other members—faster than it parses control-plane-2-us-west-2b.
I'm encoding meaning into fewer characters and leveraging how memory actually works.
No context switching, no lookup tables, just pattern recognition.
Of course, this information is also encoded in Kubernetes labels (role=control-plane, zone=odd-eye-circle, etc.) for proper workload scheduling and automation. But labels are for machines—names are for humans staring at dashboards during incidents.
Cluster Setup
I've written about Kubernetes setup with Talos Linux in detail before, so I won't rehash that here.
The main difference this time, I'm using OpenTofu instead of applying configs directly via talosctl.
Using OpenTofu for Talos configuration gives you the usual IaC benefits, but a few things are particularly nice:
- Autogenerated configs as outputs. Run
terraform applyand get your talosconfig and kubeconfig ready to use immediately—no separate generation steps or manual copying. - Loop over node pools with
for_each. Instead of running talosctl apply-config 12 times, define your node pools once and let Terraform handle the iteration. Control plane nodes, worker nodes across zones—all managed declaratively. - Patches as code. Custom PKI, registry mirrors, sysctls, whatever—it's all versioned alongside your infrastructure definitions. Changes are reviewable, repeatable, and don't rely on remembering which talosctl patch commands you ran six months ago.
Here are the key patches I'm applying to the base Talos configuration:
-
The custom PKI and DNS configuration for resolving my .testlab.kube domain:
yamlencode({ apiVersion = "v1alpha1" kind = "TrustedRootsConfig" name = "testlab-key" certificates = <<-EOT -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE----- EOT machine = { network = { nameservers = [ "xx.xx.xx.xx" ] } } }), -
Registry mirrors are configured to use my Harbor instance on the old cluster:
# nested under machine.registries mirrors = { dockerhub = { endpoints = [ "https://harbor.prod.service.testlab.kube/v2/proxy-cache-dockerhub" ] overridePath: true } "docker.io" = { ... } # repeat for gcr/ghcr, quay, etc } -
The network interface configuration specific to my hosting provider:
# nested under machine.network interfaces = [ { interface = "enp9s0" addresses = ["${each.value.internal_ip}/24"] routes = [ { network = "10.0.0.0/8" gateway = "10.132.0.1" } ] }, { interface = "enp7s0" routes = [ { network = "0.0.0.0/0" gateway = "xx.xx.xx.xx" } ] } ]
Bootstrapping Initial Workloads
The bootstrapping process is largely the same as what I've covered in previous posts, so I'll keep this brief.
Helmfile is still my tool of choice for initial cluster setup—declarative, version-controlled, and handles dependencies cleanly. The main change: I'm using Cilium this time instead of Istio. I want the eBPF-based networking and observability capabilities without the service mesh overhead. For my workloads, Cilium's kernel-level visibility and performance characteristics are a better fit.
ArgoCD bootstrapping remains unchanged. Helmfile gets ArgoCD onto the cluster, then ArgoCD takes over for everything else. Once it's running, the cluster manages itself.
Performance Improvements
Once workloads migrated over, the improvement was immediately noticeable.
The most measurable improvement shows up in Mimir metrics. P99 query latency dropped from ~1 second with regular 5-second spikes down to 100-200ms with minimal spiking. More importantly, P50 and average latency converged—they're now nearly identical and stable, because the P99 isn't dragging the average up anymore.
Network throughput made the biggest difference. Contabo caps at 100Mbps; the new provider gives me 2.5Gbps. The old bottleneck hit Mimir's ingester/distributor pipeline hard—I had to run extra replicas just to spread traffic across more network interfaces and avoid choking on writes. With 2.5Gbps available, I can run fewer replicas and there's significantly less contention. Query performance improved, but so did internal cluster communication. Mimir's hash ring membership is now stable because gossip traffic isn't fighting for bandwidth with actual metrics ingestion.
Higher disk write speeds also stabilized the in-cluster object storage. No more backpressure during compaction or block uploads causing cascading delays.
Closing Thoughts
Building a cluster with a clear vision is always cleaner. Capacity planning from the start, naming conventions that encode operational context, declarative infrastructure with OpenTofu, and actual hardware that doesn't bottleneck at 100Mbps—it all adds up.
The metrics speak for themselves, but beyond the numbers, there's something satisfying about infrastructure that just works.
Backlinks