Observability

With the cluster running and services exposed, I had no visibility into what was actually happening inside. Time to fix that.

Design

Here's the high-level observability architecture running in my Kubernetes cluster:

graph LR subgraph deployment alloy end subgraph "node" subgraph userspace kubelet containerd end subgraph bar[daemonset] otelcol[OTEL collector] beyla end pod[application pod] end subgraph monitoring tempo mimir loki end pod -.trace info .-> beyla --traces, metrics, log association--> otelcol --traces--> tempo containerd --granular container metrics--> otelcol --granular container metrics--> mimir otelcol --RED metrics--> mimir otelcol --log association ---> loki kubelet --cadvisor metrics--> alloy --pod metrics--> mimir pod --container logs, metrics--> alloy --container logs --> loki tempo --service graphs--> mimir

The observability pipeline in my cluster is designed around the idea that different types of data have different collection needs. Grafana Alloy runs in clustered mode as a Deployment, using discovery.kubernetes blocks to discover targets and scrape pod metrics (read here about why I'm using Alloy instead of Prometheus). I also decided to route log collection through Alloy. While logs could technically be gathered locally on each node, Alloy’s log processing features , e.g. structured metadata creation, were good enough that it made sense to consolidate that work there.

Not all data benefits from this centralized approach, though. For high-volume or latency-sensitive sources, I use node-local collection to minimize cross-node traffic. The OpenTelemetry Collector runs as a DaemonSet and scrapes containerd metrics directly on each node. Beyla, also running as a DaemonSet, attaches via eBPF to capture application traces without requiring code instrumentation, forwarding traces and metrics to their respective storage backends through the OpenTelemetry collector.

One nuance in this design is how trace-derived metrics from Beyla are handled. The number of series generated by Beyla ~50% of all ingested by the cluster, so they're stored under a separate Mimir tenant. This will allow for per-tenant overrides in the future, if necessary.

This separation, combined with the split between centralized and node-local collection, keeps the core observability pipeline lean while still giving me fine-grained insight into the cluster.

Basic Observability

Before putting the full observability stack in place, I wanted a way to verify that what I was collecting matched what the cluster was actually experiencing in real time. The simplest approach was to start with the Kubernetes Dashboard and the metrics-server. The metrics-server gives you access to live resource usage data straight from the kubelet, which makes it perfect for quick sanity checks. With it running, the kubectl top command and the dashboard’s built-in resource views become an immediate point of comparison against the values being scraped by my observability pipeline.

While this setup was essential for validating my metrics early on, the metrics-server also plays a long-term role in the cluster. It’s the standard source of resource metrics for Kubernetes Horizontal Pod Autoscalers, allowing workloads to scale up or down based on CPU and memory usage without any extra configuration. This means that even after the rest of the observability stack is in place, metrics-server continues to serve as both a validation tool and a core component for workload autoscaling.

Out of the box, it didn’t quite work, given my networking setup - the API server needed to trust my custom certificates. To resolve this, I used the tls configuration block, specifying the StepClusterIssuer I had already created:

tls:
    type: "cert-manager"
    certManager:
    addInjectorAnnotations: false
    existingIssuer:
        enabled: true
        kind: "StepClusterIssuer"
        name: "step-issuer"

Once that was in place, things just worked:

>kubectl top nodes
NAME       CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)   
giselle    5575m        70%      15603Mi         66%         
irene      1015m        17%      5403Mi          47%         
joy        2617m        43%      9009Mi          58%         
karina     2294m        38%      4881Mi          43%         
ningning   2278m        28%      10870Mi         46%         
seulgi     795m         20%      3033Mi          55%         
wendy      7611m        63%      27474Mi         57%         
winter     4436m        55%      17569Mi         74%         
yeri       3332m        56%      6286Mi          40%         

Ingesting Logs

With the baseline metrics in place, the next step was to get log collection up and running. For this, I chose Loki as the backend. Loki’s index-free design keeps storage costs low while still letting me query across large volumes of logs, which is ideal for a self-hosted setup. Given my relatively low ingestion volume, I deployed it in simple scalable mode. This keeps the deployment footprint and operational complexity small while still allowing the system to scale out later if needed.

On the collection side, I configured Grafana Alloy to tail logs directly from Kubernetes pods with discovery.kubernetes and loki.source blocks:

discovery.kubernetes "pods" {
    role = "pod"
}
loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  ...
}

Alloy’s log pipeline features — particularly its relabeling, parsing, and field extraction — made it a strong choice for handling logs before they hit Loki. In my setup, all applications run in debug mode by default, but the debug-level logs are dropped at the collection layer. This means I pay a small CPU cost for tailing those extra lines, but the tradeoff is flexibility: if I ever need full debug output from a service, I can update the Alloy configuration and immediately start forwarding debug logs to Loki without touching the application itself. No restarts, no redeploys — just a quick config change and the extra detail is live.

This approach also gives me the opportunity to enrich logs before they’re stored. Alloy automatically attaches Kubernetes metadata such as pod name, namespace, and container image, and it can normalize log formats so that queries in Loki are consistent regardless of the application’s original logging style. Combined with autodiscovery, this means new workloads start streaming enriched, filtered logs into Loki as soon as they hit the cluster, with minimal operational overhead.

Ingesting Metrics

With logs flowing into Loki, the next layer to set up was metrics. Detailed information about the metric collection can be found here, so I'll summarize it quickly:

  • cluster-wide metric endpoint autodiscovery is handled by Alloy
  • the OpenTelemetry collector scrapes containerd metrics

All metrics flow into Grafana Mimir, which I have deployed in microservices mode. In practice, I run it with far fewer resources than the default Helm chart suggests, since my workload doesn’t require the full reference capacity. The main tuning challenge hasn’t been CPU or memory — it’s network throughput. My hosting provider, Contabo, caps each node at 100 Mb/s, so I’ve set relatively high replica counts for certain components to spread the traffic load. This keeps ingestion and query performance stable even when there’s still unused CPU headroom but the bandwidth limit is in play.

To ensure that ingester pods are placed properly, I use the following topologySpreadConstraints along with affinity:

topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: ingester
      matchLabelKeys:
        - pod-template-hash  # This helps with rolling updates

  affinity:
    podAntiAffinity:
      # Soft preference: try to spread pods across nodes
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - ingester
            topologyKey: 'kubernetes.io/hostname'

This allows a best-effort distribution of the ingesters evenly among the nodes, without fully restricting scheduling if nodes are not available.

Ingesting Traces

The final piece of the observability stack is tracing, which ties together the metrics and logs into a cohesive view of application behavior. Beyla runs on every node, using eBPF to capture trace data without requiring any changes to application code. In addition to traces, Beyla also emits associated metrics and log context, so a single request can be followed across all three observability pillars. All of this data is sent to a node-local OpenTelemetry Collector.

The OTEL Collector acts as the central router for tracing data. It exports spans to Tempo, sends trace-derived metrics to Mimir, and pushes any relevant logs to Loki. This ensures that the same request can be explored from multiple angles — from its raw logs, to aggregated metrics, to a detailed span breakdown — all linked by the same context.

Tempo is configured to generate service graphs, which are then written back into Mimir as metrics. This creates a feedback loop between tracing and metrics: I can spot service-to-service latency or error patterns in a dashboard, then drill into the trace data in Tempo to see exactly where in the request path the issue originates. With Beyla providing the raw visibility, the OTEL Collector directing traffic, and Tempo, Mimir, and Loki each handling their respective data types, the cluster has a full observability pipeline where every signal reinforces the others.

Below is a screenshot of the service graph capability along with the trace data for a /sign request from step-ca:

Closing Thoughts

Overall, this setup has proven to be a solid, reliable foundation for observability in my cluster, covering metrics, logs, and traces in a way that’s both practical to operate and easy to extend. There’s still room to grow — adding exemplars would make it even easier to pivot directly from metrics to traces, and integrating tools like Parca and Pyroscope would bring continuous profiling into the mix. But even as it stands, the system delivers the visibility I need while leaving a clear path for future enhancements.