Legacy NMS — the FCAPS box that polled SNMP, ingested CSV PM files, and rendered PowerPoint dashboards — is finally dying. In 2026, every greenfield 5G core NOC is built on the same stack: Prometheus, Grafana, Tempo, Loki. Sometimes Mimir or VictoriaMetrics for long-term storage. The acronym is LGTM (Loki, Grafana, Tempo, Mimir).

If you are still running OSS Mediation 4.x, this is what your replacement looks like.

The three pillars, telecom edition

PillarTool5G use case
MetricsPrometheus / MimirSession count, throughput, NRF registration health, RAN KPIs
TracesTempoSBA call chains: AMF -> SMF -> UPF, latency per hop
LogsLokiNF logs, gNB syslogs, audit trails

The magic is correlation. Click a latency spike in a Grafana panel, jump to the trace, jump from trace span to the logs of the pod that emitted it. One UI, three data stores, two clicks.

Prometheus for 5G NFs

3GPP TS 28.552 defines the PM measurements. Most cloud-native NFs expose them on a /metrics endpoint in Prometheus exposition format. The naming is not perfectly standardized — vendors map 3GPP measurement names to Prometheus metric names with their own prefixes — but the values are correct.

A reasonable scrape config:

scrape_configs:
  - job_name: 5gc-nfs
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_label_nf_type]
        target_label: nf_type
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Key metrics to watch per NF:

  • NRF: nrf_nf_instances_total, nrf_nf_status_notify_total
  • AMF: amf_registered_subscribers, amf_n1_n2_message_transfer_total, amf_pdu_session_establishment_total
  • SMF: smf_session_count, smf_pfcp_message_total{type="..."}
  • UPF: upf_session_count, upf_throughput_bytes, upf_dropped_packets_total

RAN side, gNBs increasingly export Prometheus directly or via an O1 -> Prometheus adapter. O-RAN SC's SMO already speaks Prometheus.

Recording rules for derived KPIs

3GPP KPIs like Registration Success Rate are ratios. Compute them once with recording rules:

groups:
  - name: 5gc-kpis
    interval: 30s
    rules:
      - record: amf:registration_success_rate:5m
        expr: |
          sum(rate(amf_registration_total{result="success"}[5m]))
            /
          sum(rate(amf_registration_total[5m]))
      - record: smf:session_establishment_success_rate:5m
        expr: |
          sum(rate(smf_session_establishment_total{result="success"}[5m]))
            /
          sum(rate(smf_session_establishment_total[5m]))

Dashboards now query the recording rule, which is cheap, instead of recomputing the ratio every render.

Tempo and distributed tracing

SBA is HTTP/2-based. Every NF that supports W3C Trace Context propagates traceparent headers across SBI calls. Without tracing, debugging "why did this PDU session establishment take 8 seconds" means correlating logs across 5 NFs by timestamp. With tracing, you get a flame graph.

Typical PDU session establishment trace:

amf.handle_pdu_session_establishment       8200ms
  amf.smf_select (NRF)                       45ms
  amf.create_sm_context (SMF)              7800ms  <-- here
    smf.udm_get_session_management_data     120ms
    smf.pcf_select_for_session              80ms
    smf.pcf_create_session                 1500ms
    smf.upf_select                          30ms
    smf.n4_session_establishment           6000ms  <-- and here
  amf.n2_pdu_session_request_to_gnb         200ms

Now you know the bottleneck is N4 to UPF, not AMF or SMF logic. That investigation used to take an afternoon.

Tempo stores traces cheaply (object storage). Sample at 1-5% in steady state, 100% during incidents (use head-based sampling with overrides, or tail-based sampling with the OpenTelemetry Collector).

Loki for NF logs

Loki indexes labels, not log content. This is the right tradeoff for telecom: you query by nf_type="smf", region="eu-west-1", severity="error", then grep within that stream.

A useful LogQL query, finding PFCP failures correlated with a specific UPF:

{nf_type="smf", region="eu-west-1"}
  |~ "PFCP" |~ "failure|timeout"
  | json
  | upf_node_id="upf-edge-03"

NF log discipline matters more than the tooling. Structured JSON logs, with consistent fields (nf_instance_id, supi when present and policy allows, pdu_session_id, trace_id), make Loki useful. Free-text logs make it a slow grep.

SUPI in logs is a privacy issue. Log the SUCI or a hashed identifier. Most NFs have a config flag for this — turn it on.

Correlating across pillars

The Grafana feature that ties it together is derived fields in data source config:

  • A log line in Loki containing trace_id=abc123 becomes a clickable link to that trace in Tempo.
  • A trace span in Tempo links to logs filtered by trace_id=abc123 in Loki.
  • A metric panel in Grafana with exemplars shows trace IDs at outlier points; click to jump to Tempo.

Set up exemplars on Prometheus (requires --enable-feature=exemplar-storage and OpenTelemetry instrumentation in NFs). One enabled exemplar feature pays for the whole observability migration.

Storage sizing, roughly

For a mid-size operator (5M subscribers, 9 NFs, 3 regions):

PillarDaily volume90-day retention
Metrics50-100 GB5-10 TB (Mimir, compressed)
Traces200-500 GB at 5% samplingobject storage, lifecycle to cold after 14d
Logs1-3 TB90-270 TB raw, ~30% with Loki compression

Logs dominate. Get aggressive about log levels (INFO not DEBUG in production) and compress aggressively. Cold storage tiering after 14 days is standard.

Replacing legacy NMS

The migration that works:

  1. Stand up Prometheus/Grafana alongside the legacy NMS. Both run.
  2. Build the top 20 dashboards the NOC actually uses on the new stack.
  3. Build alerting rules in Alertmanager mirroring the legacy thresholds.
  4. Run shadow mode for 60 days. Both alert. Compare false positive/negative rates.
  5. Cut over alerting. Decommission legacy NMS in two phases: alerting first, dashboards second.

The failure mode is trying to replicate the legacy NMS dashboard library 1:1. Most of those dashboards were never used. Audit usage logs first.

Alertmanager: routing for telecom

NOCs have escalation hierarchies. Alertmanager routes by labels:

route:
  receiver: noc-tier1
  group_by: [alertname, region, nf_type]
  routes:
    - matchers: [severity="critical", nf_type=~"upf|smf|amf"]
      receiver: noc-tier2-pager
      continue: true
    - matchers: [region="eu-west-1"]
      receiver: eu-noc

group_by collapses 50 alerts about one failed UPF into one notification. This is the difference between "actionable" and "page storm."

> The LGTM stack is not a telecom product, but it has eaten telecom observability because it is honest about cost and because correlation across metrics, traces, and logs is the only way to debug a cloud-native 5G core.

If your NOC dashboard still opens in Internet Explorer, the cost of staying there is now higher than the cost of switching.