Proactive monitoring with Zabbix for ISPs: what your NOC needs to see before the subscriber calls
In the daily operation of an ISP, there are two types of problems: those your team detects first, and those the subscriber detects first. The difference between those two scenarios is the difference between a managed incident and a crisis. Between a customer who renews and one who cancels.
Reactive monitoring (wait for the ticket, handle the complaint, then investigate) was acceptable when subscribers had few alternatives. Today it is not. And the good news is that the tools for proactive monitoring are accessible, mature, and entirely viable for ISPs of any size in Latin America.
Zabbix is the central platform of that stack. This guide covers what to monitor, how to structure it, and how to integrate it with Grafana so your NOC has real visibility before the phone rings.
Why reactive monitoring destroys the subscriber experience
Reactive monitoring has a structural problem: the subscriber is always the first detector. When an OLT loses a card, when an uplink saturates at 98%, when a BGP session drops at 3 in the morning, the first visible signal for the ISP is the volume of support calls.
That model has cascading consequences. The time between the failure and detection (MTTD) can be 15, 30, 60 minutes. The time to resolution (MTTR) only starts running when the NOC receives the human alert. Meanwhile, the subscriber has already formed their opinion about service quality.
Beyond the experience: in environments with corporate SLAs, every minute of undetected failure is a minute accumulated against the contractual uptime commitment. Late detection is not just a perception problem: it can have direct economic cost.
Proactive monitoring inverts that cycle. The system detects degradation (not total failure, but degradation) and alerts the team before the subscriber perceives it as a problem.
What an ISP must monitor: the list that cannot be missing
The most common mistake when starting with Zabbix is monitoring only what is easy to add (ping to device management interfaces) and assuming that is enough. It is not. There are monitoring categories that must be active to have real visibility:
Network interfaces (status and utilization)
The operational status of every critical interface (transit uplinks, inter-POP links, interfaces toward OLTs) must be monitored with differentiated alerts: one alert when the interface goes down (ifOperStatus), and another when utilization exceeds thresholds (for example, alert at 80% capacity, critical at 95%). Link saturation that reaches 100% for hours degrades the experience without generating a visible outage: this is exactly the type of problem the subscriber feels and the NOC does not see without utilization monitoring.
BGP sessions
Every eBGP and iBGP session must have an item in Zabbix with an immediate alert on state change. A downed BGP session can mean loss of transit, peers, or routing redundancy. The acceptable detection time for a BGP session is seconds, not minutes. (On the BGP configuration errors that amplify these problems, see our article on BGP for ISPs.)
OSPF and IS-IS adjacencies
In networks with IGP running in the core or distribution, the loss of an OSPF or IS-IS adjacency can cause non-optimal convergence or routing gaps that are not evident until traffic starts dropping. Monitor the number of active adjacencies per router and alert on any reduction.
CPU and memory on network equipment
A router with CPU at 95% is about to start dropping control packets, degrading BGP, or simply locking up. Monitor CPU and memory on all core and distribution equipment, with thresholds that trigger preventive alerts before reaching the limit.
OLTs and PON port status
In GPON/EPON networks, the status of line cards and PON ports must be monitored in real time. The loss of an OLT card can affect dozens or hundreds of subscribers simultaneously. Error counters (FEC corrections, BIP errors) are early indicators of optical problems before the ONT loses signal completely.
RADIUS and authentication
If the RADIUS server is not responding, subscribers cannot authenticate when they restart their connection. In networks with many simultaneous outages and reconnections (a power cut in a neighborhood, for example), RADIUS saturation or failure can turn a minor incident into a massive problem. Monitor RADIUS availability, authentication response time, and authentication error rate.
Loop latency (internal round-trip time)
Internal latency between owned POPs is a network health indicator that rarely fails suddenly but degrades gradually. Monitoring RTT between key network points with historical thresholds allows detecting latency increases that have not yet manifested as a failure but anticipate a capacity or routing problem.
Transit uplinks and total utilization
Transit uplink saturation directly impacts the subscriber’s experience with internet destinations. Monitor total utilization and by direction (in/out), with growth projections if possible.
Zabbix as a platform: where to start
Zabbix is an open source monitoring platform with over 20 years of active development. For ISPs it is particularly well suited for three reasons: native SNMP support, the ability to scale to tens of thousands of items, and a growing collection of official templates for the most common vendors.
Official templates for ISP vendors
Zabbix maintains a template library covering the main vendors:
- Huawei: templates for VRP (router/switch), Huawei CE series (datacenter switches), SmartAX (MA5600/MA5800 OLTs). The official Huawei VRP template covers interfaces, BGP, OSPF, CPU, and memory via SNMP.
- MikroTik: official template covering RouterOS via SNMP and via API. Includes interfaces, BGP sessions (if RouterOS >= 7.x), CPU, memory, and routing tables.
- Cisco IOS/IOS-XE/IOS-XR: templates covering interfaces, BGP, OSPF, ISIS, CPU, and memory. Cisco IOS-XR has additional support via gRPC/gNMI for streaming telemetry.
These templates are a starting point, not a final configuration. In production you always need to adjust thresholds, disable irrelevant items, and add custom items for metrics specific to each network.
SNMP polling: the foundation of network monitoring
Most network metrics are obtained via SNMP v2c or v3. The basic Zabbix configuration for a Huawei device:
# On the Huawei device (VRP)
snmp-agent
snmp-agent community read COMUNIDAD-LECTURA
snmp-agent sys-info version v2c v3
snmp-agent target-host trap address udp-domain 10.0.0.100 params securityname COMUNIDAD-LECTURA
# In Zabbix (host configuration)
SNMP community: {$SNMP_COMMUNITY}
SNMP version: SNMPv2
Port: 161
For networks with stricter security requirements, SNMPv3 with SHA authentication and AES privacy is the right choice:
snmp-agent usm-user v3 ZABBIX-USER
authentication-mode sha PASSPHRASE-AUTH
privacy-mode aes128 PASSPHRASE-PRIV
ICMP for basic availability
In addition to SNMP polling, Zabbix should perform periodic ICMP checks to the management IP of each device. This detects total loss of access when SNMP is no longer responding. The recommended interval for critical equipment is 30 to 60 seconds.
Zabbix Agent for infrastructure servers
For servers that are part of the ISP infrastructure (RADIUS, authoritative DNS, captive portals, management servers), the Zabbix agent allows more granular monitoring: disk, specific processes, logs, services. Install the agent on these servers and use the corresponding Linux templates.
SNMP Traps vs active polling: when to use each
This is one of the questions that most frequently causes confusion in NOC teams structuring their monitoring.
SNMP Traps are messages that network equipment actively sends to the monitoring server when a specific event occurs: an interface goes down, a CPU threshold is exceeded, an OSPF adjacency drops. They are immediate: the time between the event and the notification can be milliseconds. Their disadvantage is that they depend on the device being alive and configured to send the trap. If the device fails catastrophically, it may not manage to send the trap.
Active polling is the inverse cycle: Zabbix periodically queries the device (every 1, 5, or 10 minutes depending on the item) and records the value. It detects both failures and gradual degradations. Its disadvantage is detection latency: if a link goes down and the next poll is in 3 minutes, detection is delayed until that moment.
The correct answer for an ISP is to use both in complement:
- SNMP Traps for discrete, high-criticality events: interface state change, BGP state change, role change in high-availability protocols (VRRP, MC-LAG). These events deserve immediate notification.
- Active polling for continuous metrics: CPU utilization, interface utilization, latency, memory. Polling builds the trend history that enables detecting gradual degradations.
In Zabbix, trap reception is configured by enabling the zabbix_trap_receiver process and configuring snmptrapd as the receiver:
# /etc/snmp/snmptrapd.conf
authCommunity log,execute,net COMUNIDAD-TRAP
traphandle default /usr/lib/zabbix/externalscripts/zabbix_trap_handler.pl
Integration with Grafana for operational NOC dashboards
Zabbix has its own visualizations, but for NOC dashboards designed to be viewed on large screens during 8-hour shifts, Grafana offers a significantly better experience.
The integration is done via the official Zabbix plugin for Grafana (developed by Alexander Alexandrov, currently maintained as a community plugin):
grafana-cli plugins install alexanderzobnin-zabbix-app
Once the datasource is configured pointing to the Zabbix API, it is possible to build dashboards that combine:
- Real-time network map: status of each critical link with color by state (green/yellow/red based on utilization or availability).
- Active alerts table: sorted by severity and start time, with information on which device and which metric triggered the alert.
- Uplink utilization graphs: in real time and with a 24-hour historical window to identify patterns.
- BGP sessions panel: status of each session, uptime, prefixes received/announced.
- Availability by geographic zone: if devices are tagged by zone in Zabbix, a regional map quickly shows where the problem is.
The goal of the NOC dashboard is not to show all available information: it is to allow the on-duty operator to identify in seconds whether something requires action, and what that something is.
Escalated alerts: NOC, engineering, and night watch
Having visibility is a necessary but not sufficient condition. Visibility must translate into action, and for that the alerting scheme must be designed for the actual operational context.
An effective escalation model for ISPs has three levels:
Level 1 — NOC (dashboard alert and immediate notification)
All critical events reach the NOC dashboard and generate a notification via messaging channel (Telegram, Slack, email). The NOC operator has the first level of response: verify, attempt to resolve with documented procedures, or escalate.
Level 2 — Engineering (if NOC does not resolve in N minutes)
If the alert is not acknowledged or resolved within the defined time (for example, 15 minutes for a downed uplink), Zabbix automatically escalates the notification to the network engineering team. Zabbix supports native escalation via “escalation steps” in action configuration.
# Example action with escalation in Zabbix
Action: UPLINK-DOWN
Step 1 (0 min): Notify NOC-Team via Telegram
Step 2 (15 min, if unresolved): Notify Network-Engineering via Telegram + call
Step 3 (30 min, if unresolved): Notify Technical-Management
Level 3 — Night watch (outside office hours)
For high-severity incidents outside NOC hours, escalation must include phone contact or PagerDuty. Integrating Zabbix with on-call management services (PagerDuty, OpsGenie, or a custom scheme via webhook) ensures that a 3 AM incident does not wait until 9 AM to be addressed.
The severity criteria must be agreed upon with the team before configuring it: not every problem justifies waking someone up. A management interface down at 2 AM can probably wait. A main transit uplink down cannot.
Key metrics: the numbers that matter
A well-configured monitoring system produces metrics that enable both real-time detection and trend analysis. The metrics every ISP must be able to answer with data:
Service availability
Percentage of time that critical links and services were operational during the period. Zabbix calculates this automatically for each item when the SLA report is configured. The target should be defined based on contractual commitments with customers.
Average latency and 95th percentile
Average latency can hide spikes. The 95th percentile of latency (the value that 95% of measurements do not exceed) is a more honest indicator of the subscriber experience. Monitor both.
Packet loss
Zabbix measures packet loss in ICMP checks. A 1-2% loss on a link is an early indicator of congestion or an incipient physical problem. Define thresholds: alert at 1%, critical at 5%.
Link saturation (peak and average)
Peak utilization (maximum in 5 minutes) and the average during peak traffic periods (busy hour) are the two numbers that determine whether capacity needs to be expanded. A link with a 40% average but 95% peaks for 2 hours a day has a capacity problem even though the average “looks fine.”
Operational MTTD and MTTR
With Zabbix data it is possible to calculate the average detection time (time between event start and first acknowledgment in the system) and the average resolution time. These two KPIs should be in the NOC’s monthly operational report.
Notes on vendors: Huawei, MikroTik, and Cisco
Each vendor has particularities that affect how monitoring is implemented:
Huawei VRP: SNMP support is complete and stable. Huawei MIBs for BGP (hwBgp), OSPF, and IS-IS are well documented. The official Zabbix template for Huawei VRP covers most use cases. For Huawei SmartAX OLTs, PON port monitoring via SNMP (status, rx optical power, FEC errors) requires the series-specific MIBs, available on the Huawei support portal.
MikroTik RouterOS: for versions prior to 7.x, BGP monitoring via SNMP has limitations (the standard BGP4-MIB is not always fully implemented). The alternative is using the RouterOS API from Zabbix via external scripts, or upgrading to RouterOS 7.x where SNMP support for BGP is more complete. For interface and system resource monitoring, the official template works well from RouterOS 6.x.
Cisco IOS-XE / IOS-XR: SNMP support in Cisco is mature. IOS-XR adds the ability to use gRPC/gNMI for streaming telemetry, which is more efficient than polling for networks with thousands of metrics. Integrating gNMI telemetry with Zabbix requires an intermediary collector (such as Telegraf + InfluxDB, or a custom script), but it is the right direction for medium/large-scale Cisco networks.
Our field experience
At Ayuda.LA we work with NOC teams at ISPs of various sizes across Latin America. The pattern we see most frequently is not a lack of tools: it is Zabbix (or another system) installed but configured only for basic availability (ping), without coverage of the items that actually anticipate problems.
The jump from “availability monitoring” to “proactive operational monitoring” does not require replacing anything. It requires adding items, defining thresholds with operational criteria, building dashboards oriented toward the NOC operator, and agreeing on an escalation process that works at 3 in the morning.
In most cases, that work can be done in days or weeks (not months), and the impact on MTTD is immediate.
Learn more about our NOC and monitoring services for ISPs.
Want to review or structure your NOC’s monitoring?
We can evaluate your current Zabbix configuration, identify the most critical coverage gaps, and propose an alert and dashboard structure tailored to your operation. Without needing to replace what you already have working.
Have questions about Zabbix, SNMP, or ISP network monitoring? Write to us at [email protected] — we respond to every message.