Retrospective

Now that I have core infrastructure services like Postgres and Valkey in place, secure service-to-service communication with Istio, cert-manager, and step-ca—all with full observability through Alloy, Loki, Grafana, Mimir, and Tempo, and uptime monitoring handled by OneUptime—this Kubernetes setup is in a solid state. This retrospective is a chance to step back, reflect on what went well, what I’d do differently, and what I learned from building and maintaining this system end to end.

Why I Started

This project started as a way to deepen my understanding of Kubernetes. I wanted hands-on experience with the kinds of tools and patterns used in real-world clusters—things like GitOps, service meshes, observability stacks, and production-grade secrets management. Rather than just reading docs or running isolated demos, I decided to build and maintain a full setup myself. The goal was to explore how these pieces fit together, what the operational pain points are, and how to solve them in practice.

Challenges I Hit

The biggest challenge by far was scope creep. What started as a small learning project I thought would take a month or two ended up spanning nearly a year. Not a full year of 40-hour weeks, but enough time to realize how quickly things can grow once you start trying to do it “the right way.”

One of the early constraints I placed on the project was avoiding reliance on external services—no external DNS, no public certificate authority, no cloud-managed Postgres. That decision forced me to go much deeper than I initially expected, adding services like step-ca, Harbor, and OpenBao just to fill in the missing infrastructure.

As the project grew, I also ran into real-world operational issues:

I had to scale up the cluster, both by adding worker nodes and resizing control plane nodes, just to handle the memory and CPU demands of newer workloads.
Deploying large applications like OneUptime was especially difficult. It runs fine once it’s live, but the initial ArgoCD sync spun up so many containers at once that it caused cascading failures across the cluster, leading to frequent instability during deploys.

In short, this turned into a much more production-like system than I’d planned, and with that came a lot of the complexity you’d expect in production too.

Surprises Along the Way

A few things ended up being much more complex than I initially expected.

Managing self-signed certificates was a bigger challenge than I’d planned for. I thought cert-manager and step-ca would cover most of the work, but integrating them into a working internal PKI—especially with Istio and ArgoCD—required a lot of careful configuration and troubleshooting around trust propagation.
Injecting OpenBao secrets into pods turned out to be another deep rabbit hole. Compared to cloud-native solutions like AWS Secrets Manager, there’s a lot more you have to handle yourself—authentication, syncing, and pod access patterns all required custom setup.
Learning Kyverno on the fly was another unexpected twist. To mount the internal CA certs into the OneUptime probe containers, I needed a policy-based solution that didn’t involve modifying upstream manifests. That led me to Kyverno, which worked well—but it wasn’t something I expected to need at the start.
ArgoCD doesn’t always “just work.” While GitOps made deployments cleaner overall, I ran into cases where ArgoCD couldn’t apply certain CRDs due to length or complexity. Handling large CRDs required patching resource overrides or manually configuring settings like ignoreDifferences, which added a layer of ongoing maintenance I hadn’t expected.

These issues all seemed minor when I scoped the project, but each one ballooned into its own subproject—teaching me a lot in the process.

What I Learned

High-availability metrics ingestion is genuinely hard. Running a system like Mimir at any meaningful scale alongside Prometheus auto-discovering scrape targets puts serious pressure on the control plane, the ingesters, and every component in between. To support out-of-order sampling and consistent ingestion under load, I had to scale up not just the ingesters, but nearly every other component as well. Getting that system stable and performant required more tuning—and more hardware—than I expected. Cascading failures were frequent.
I became intimately familiar with the entire Kubernetes resource lifecycle. That includes how resources are created and updated, how mutating and validating webhooks interact with them, and how init containers, sidecars, and other patterns fit into the broader deployment model. I also had to dig into resource scheduling, especially as workloads grew unevenly across nodes. Introducing the Kubernetes descheduler helped alleviate under- and over-provisioning, and gave me hands-on experience with how the scheduler actually makes decisions—and what happens when those decisions age poorly over time.
I gained a real appreciation for why managed Kubernetes is expensive. I’ve always understood, in theory, that paying for GKE or EKS means offloading complexity—but doing everything manually, from storage (Longhorn) to secrets (OpenBao) to service meshes (Istio) and cert management (step-ca, cert-manager), gave me firsthand insight into just how much work those services are abstracting away. You’re not just paying for infrastructure—you’re paying for people who’ve already solved the hard parts.

What I’d Do Differently

If I were starting over, there are a few architectural and operational decisions I’d approach differently based on what I’ve learned:

Automate node hostname assignment. I initially set hostnames manually, but mapping IP addresses to hostnames and returning a node-specific config would have been easy to automate—and would’ve saved time and reduced human error during node provisioning.
Introduce a dev/test/prod cluster progression with stability checks. One of the biggest issues I faced was cluster instability when deploying new services. Having staged environments—even lightweight ones—would’ve helped catch resource spikes or bad manifests before they impacted core services. It wasn’t a dealbreaker for my use case, but it’s something I’d absolutely implement in a team setting.
Separate in-cluster and out-of-cluster observability. When the LGTM stack went down, I lost major visibility into what was happening inside the cluster. Next time, I’d push metrics and logs to an external “SRE” cluster with different retention rules, so I’d have a fallback view even if the main observability tools failed.
Choose larger nodes from the start. I picked 4–6 vCPU nodes for cost efficiency, but the time I lost dealing with scheduling pressure, resource contention, and scaling issues far outweighed any savings. The “cheapest per core” option wasn’t the best choice once the system started scaling.

Final Thoughts

From an educational perspective, this project was absolutely worth it. I learned more than I expected—both about Kubernetes itself and about the hidden complexity behind running production-grade infrastructure.

I also met the goals I set at the start. I have a working setup that I understand inside and out, and I plan to use this current cluster as the foundation for a v2 with proper dev/test/prod environments and cleaner operational boundaries.

If I had one piece of advice for others thinking about doing something similar: this was hard. Like, really hard. If your goal is simply to get something into production, you don’t need to go this deep. It’s easy to read polished articles or GitHub READMEs and assume everything “just works,” but the truth is, I spent a lot of time debugging weird edge cases and chasing down failures. That was part of the learning—but it’s important to know what you’re signing up for.

(Update: I've written about a successor to this cluster, as I've migrated away from Contabo)