Mutli-Cluster Monitoring

In a previous post, I outlined my decision to migrate away from Contabo. The new cluster is now stable and running smoothly, with the full LGTM stack (Loki, Grafana, Tempo, Mimir) deployed and collecting metrics, logs, and traces. The old Contabo cluster (SM) is still operational with its own complete LGTM deployment, but it needs to be decommissioned.

Here's the problem: I still need observability on the SM cluster while I wind it down. Services are being migrated gradually, and I need to monitor both what's still running and confirm that workloads have successfully moved over. Running two separate Grafana instances and context-switching between them isn't practical.

This migration created the perfect opportunity to implement something I'd been wanting to explore anyway: multi-cluster monitoring through a single Grafana UI.

Constraints

The SM cluster is already struggling with the workloads it's running - high CPU steal time, inconsistent disk performance, and a hard 100mb/s network cap. This is exactly why I'm migrating away from it.

The constraints are severe enough that I couldn't even run the full observability stack reliably. Running Alloy for scraping alongside Loki and Tempo required aggressive filtering and downsampling just to keep things stable. The network limitation is particularly problematic: Mimir replicates ingested metrics across multiple ingesters, so scrape traffic gets amplified. Scraping aggressively from the new cluster would saturate the 100mb/s cap purely from replication traffic, never mind the actual application workloads I still need to run during the migration.

The solution needed to avoid putting additional load on the SM cluster while still maintaining full observability during the wind-down.

Design

The approach centers on using Mimir's multi-tenancy capabilities. The new cluster's Mimir deployment accepts metrics from both clusters - in-cluster scrapes happen locally, while the SM cluster's Alloy instances remote-write to the new cluster's Mimir ingestion endpoint.

graph LR subgraph loona [loona cluster] mimir tempo tempo --"OrgId: *-trace-metrics"--> mimir loona.alloy[alloy] --OrgId: mixedlab--> mimir end subgraph sm [sm cluster] sm.beyla[beyla] --> tempo sm.mimir[mimir] sm.alloy[alloy] --OrgId: mixedlab--> mimir end subgraph external seaweedfs mimir --> seaweedfs end misc.prometheus[legacy prometheus hosts] --OrgId: oldlab--> sm.mimir --> seaweedfs

All infrastructure-level metrics go under a single tenant: Mimir self-monitoring metrics, cAdvisor scrapes, node metrics, and other cluster internals from both environments. While these metrics have technically unbounded cardinality, they're practically bounded in reality. The label sets are known and predictable, and time series churn follows the natural lifecycle of pods spinning up and down. Since both clusters are relatively similar in size, co-locating this data under one tenant keeps things simple without risking cardinality explosion.

One important caveat about this approach is regarding the behavior of Mimir's recording rules during version upgrades. The default recording rules, provided for use with the built-in dashboards, change frequently enough that it should be taken into account. A version mismatch will cause a failure in recording rules being evaluated, if they are not the same version as Mimir running in the cluster. The result would be a few minutes to hours of downtime during Mimir upgrades. But this is an acceptable tradeoff - the raw values don't disappear entirely, and it's easier to maintain.

Trace-derived metrics, however, get separate tenants per cluster. Application behavior during failures or traffic spikes can cause unexpected cardinality explosions - a retry storm or a misconfigured service could generate massive label combinations. Isolating these by cluster prevents bad behavior in one environment from impacting observability of the other.

Implementation

From prior experience, I knew the full metrics set I'm collecting - LGTM stack self-monitoring, Longhorn, Secrets CSI, and everything else since I don't use any cloud integrations - sits at around 400k active series. With the migration approach, there's inherent duplication: the Contabo cluster keeps its existing monitoring in place while also pushing to the new mixed tenant. That puts me at roughly 1 million active series during the transition.

At that scale, I needed Kafka at ingress to handle the write load and provide buffering. I deployed it using the Strimzi operator. Based on the ingest rate, I anticipated around 20GB of disk writes per broker in the Kafka PVCs. This required aggressive retention and truncation policies, but that's perfectly fine - Mimir is time-sensitive and rejects old samples anyway, so there's no value in Kafka holding onto data for extended periods. The goal is buffering and delivery guarantees, not long-term storage.

I also created the following storage class specifically for Kafka:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-kafka-broker
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "1"
  dataLocality: "best-effort"
  staleReplicaTimeout: "2880"
  fromBackup: ""
volumeBindingMode: "WaitForFirstConsumer"

The combination of dataLocality: best-effort and volumeBindingMode: WaitForFirstConsumer causes Longhorn to attempt scheduling broker pods and volumes on the same node, without blocking scheduling if unable due to capacity constraints, while eliminating any need for data redundancy, as Kafka already has fault-tolerance built in.

Validation

I used Mimir's built-in dashboards to validate the setup, specifically the "Mimir / Writes" dashboard. The metrics confirmed everything was working as expected:

In-memory series: 1.43M - I had anticipated around 1.2M from simple math (400k active series × 3 for duplication and mixed tenants), so this tracked closely with expectations. Samples/sec: 124k - Sustained ingest rate with no drops or backpressure. p99 write latency: ~200ms - Well within acceptable bounds for remote-write.

[Grafana dashboard snapshot would go here]

After completing the cutover, the old SN dropped to just 1.5k in-memory series. This makes sense - the only things still writing to it are legacy hosts running single-app Prometheus exporters and a handful of IoT devices that haven't been migrated yet. This also provides a built-in verification mechanism - I'll know when I've migrated the last of my metrics workloads, when this number drops to zero.