Legacy NMS — the FCAPS box that polled SNMP, ingested CSV PM files, and rendered PowerPoint dashboards — is finally dying. In 2026, every greenfield 5G core NOC is built on the same stack: Prometheus, Grafana, Tempo, Loki. Sometimes Mimir or VictoriaMetrics for long-term storage. The acronym is LGTM (Loki, Grafana, Tempo, Mimir).
If you are still running OSS Mediation 4.x, this is what your replacement looks like.
The three pillars, telecom edition
| Pillar | Tool | 5G use case |
|---|---|---|
| Metrics | Prometheus / Mimir | Session count, throughput, NRF registration health, RAN KPIs |
| Traces | Tempo | SBA call chains: AMF -> SMF -> UPF, latency per hop |
| Logs | Loki | NF logs, gNB syslogs, audit trails |
The magic is correlation. Click a latency spike in a Grafana panel, jump to the trace, jump from trace span to the logs of the pod that emitted it. One UI, three data stores, two clicks.
Prometheus for 5G NFs
3GPP TS 28.552 defines the PM measurements. Most cloud-native NFs expose them on a /metrics endpoint in Prometheus exposition format. The naming is not perfectly standardized — vendors map 3GPP measurement names to Prometheus metric names with their own prefixes — but the values are correct.
A reasonable scrape config:
scrape_configs:
- job_name: 5gc-nfs
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_label_nf_type]
target_label: nf_type
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Key metrics to watch per NF:
- NRF:
nrf_nf_instances_total,nrf_nf_status_notify_total - AMF:
amf_registered_subscribers,amf_n1_n2_message_transfer_total,amf_pdu_session_establishment_total - SMF:
smf_session_count,smf_pfcp_message_total{type="..."} - UPF:
upf_session_count,upf_throughput_bytes,upf_dropped_packets_total
RAN side, gNBs increasingly export Prometheus directly or via an O1 -> Prometheus adapter. O-RAN SC's SMO already speaks Prometheus.
Recording rules for derived KPIs
3GPP KPIs like Registration Success Rate are ratios. Compute them once with recording rules:
groups:
- name: 5gc-kpis
interval: 30s
rules:
- record: amf:registration_success_rate:5m
expr: |
sum(rate(amf_registration_total{result="success"}[5m]))
/
sum(rate(amf_registration_total[5m]))
- record: smf:session_establishment_success_rate:5m
expr: |
sum(rate(smf_session_establishment_total{result="success"}[5m]))
/
sum(rate(smf_session_establishment_total[5m]))
Dashboards now query the recording rule, which is cheap, instead of recomputing the ratio every render.
Tempo and distributed tracing
SBA is HTTP/2-based. Every NF that supports W3C Trace Context propagates traceparent headers across SBI calls. Without tracing, debugging "why did this PDU session establishment take 8 seconds" means correlating logs across 5 NFs by timestamp. With tracing, you get a flame graph.
Typical PDU session establishment trace:
amf.handle_pdu_session_establishment 8200ms
amf.smf_select (NRF) 45ms
amf.create_sm_context (SMF) 7800ms <-- here
smf.udm_get_session_management_data 120ms
smf.pcf_select_for_session 80ms
smf.pcf_create_session 1500ms
smf.upf_select 30ms
smf.n4_session_establishment 6000ms <-- and here
amf.n2_pdu_session_request_to_gnb 200ms
Now you know the bottleneck is N4 to UPF, not AMF or SMF logic. That investigation used to take an afternoon.
Tempo stores traces cheaply (object storage). Sample at 1-5% in steady state, 100% during incidents (use head-based sampling with overrides, or tail-based sampling with the OpenTelemetry Collector).
Loki for NF logs
Loki indexes labels, not log content. This is the right tradeoff for telecom: you query by nf_type="smf", region="eu-west-1", severity="error", then grep within that stream.
A useful LogQL query, finding PFCP failures correlated with a specific UPF:
{nf_type="smf", region="eu-west-1"}
|~ "PFCP" |~ "failure|timeout"
| json
| upf_node_id="upf-edge-03"
NF log discipline matters more than the tooling. Structured JSON logs, with consistent fields (nf_instance_id, supi when present and policy allows, pdu_session_id, trace_id), make Loki useful. Free-text logs make it a slow grep.
Correlating across pillars
The Grafana feature that ties it together is derived fields in data source config:
- A log line in Loki containing
trace_id=abc123becomes a clickable link to that trace in Tempo. - A trace span in Tempo links to logs filtered by
trace_id=abc123in Loki. - A metric panel in Grafana with exemplars shows trace IDs at outlier points; click to jump to Tempo.
Set up exemplars on Prometheus (requires --enable-feature=exemplar-storage and OpenTelemetry instrumentation in NFs). One enabled exemplar feature pays for the whole observability migration.
Storage sizing, roughly
For a mid-size operator (5M subscribers, 9 NFs, 3 regions):
| Pillar | Daily volume | 90-day retention |
|---|---|---|
| Metrics | 50-100 GB | 5-10 TB (Mimir, compressed) |
| Traces | 200-500 GB at 5% sampling | object storage, lifecycle to cold after 14d |
| Logs | 1-3 TB | 90-270 TB raw, ~30% with Loki compression |
Logs dominate. Get aggressive about log levels (INFO not DEBUG in production) and compress aggressively. Cold storage tiering after 14 days is standard.
Replacing legacy NMS
The migration that works:
- Stand up Prometheus/Grafana alongside the legacy NMS. Both run.
- Build the top 20 dashboards the NOC actually uses on the new stack.
- Build alerting rules in Alertmanager mirroring the legacy thresholds.
- Run shadow mode for 60 days. Both alert. Compare false positive/negative rates.
- Cut over alerting. Decommission legacy NMS in two phases: alerting first, dashboards second.
The failure mode is trying to replicate the legacy NMS dashboard library 1:1. Most of those dashboards were never used. Audit usage logs first.
Alertmanager: routing for telecom
NOCs have escalation hierarchies. Alertmanager routes by labels:
route:
receiver: noc-tier1
group_by: [alertname, region, nf_type]
routes:
- matchers: [severity="critical", nf_type=~"upf|smf|amf"]
receiver: noc-tier2-pager
continue: true
- matchers: [region="eu-west-1"]
receiver: eu-noc
group_by collapses 50 alerts about one failed UPF into one notification. This is the difference between "actionable" and "page storm."
> The LGTM stack is not a telecom product, but it has eaten telecom observability because it is honest about cost and because correlation across metrics, traces, and logs is the only way to debug a cloud-native 5G core.
If your NOC dashboard still opens in Internet Explorer, the cost of staying there is now higher than the cost of switching.