Automation in high-risk environments: when the margin for error is almost zero
In 2009, Netflix published the first paper on Chaos Engineering: deliberately injecting failures in production systems to find weaknesses before real failures do. The practice became influential in the software industry and sparked enthusiasm in infrastructure operations teams.
A few years later, a NOC engineer at a Latin American ISP with 80,000 subscribers read about Chaos Engineering and got excited. They proposed running “controlled outage” exercises to test redundancy mechanisms.
The proposal was rejected. Not because of resistance to change, but because the context was radically different: at Netflix, a failure affects video playback. At an ISP, a failure can interrupt service for hospitals, alarm systems, and businesses that depend on connectivity to bill.
That is the central point: automation and resilience philosophies that work in low-risk environments cannot be copied straight into production ISP networks. They must be adapted, and that adaptation requires changes in how automation is thought about, not only in which tools are used.
Risk asymmetry in ISP networks
In software, reversibility is high. A bad code deploy can usually be rolled back in minutes. The blast radius of a mistake in a microservice is often limited to a percentage of users, and with feature flags or circuit breakers, containment is quick.
In ISP networks, the asymmetry is different:
A misapplied configuration change can interrupt service for thousands of users in seconds, and rollback can take minutes that feel like hours.
A mistake in a routing policy can propagate incorrect routes to transit peers before the team notices, causing incidents involving third parties outside the organization’s control.
Network infrastructure is, in many cases, real critical infrastructure: hospitals, emergency services, logistics companies, and payment systems depend on connectivity the ISP provides.
This asymmetry does not mean you should not automate. It means the philosophy used to design automation systems must reflect that asymmetry from the start.
Philosophies that need adaptation
“Move fast and break things”
This startup software mantra is the exact opposite of what production ISP networks need. But the more sophisticated version —“iterate fast, fail fast, learn fast”— can apply, with one important adjustment:
“Fail fast” in networks cannot mean in production.
In software, the feedback loop can be production directly. In ISP networks, the feedback loop must be:
- Lab that mirrors production → fail here, learn here
- Maintenance window at low-impact hours → validate in production with controlled risk
- Gradual rollout with active monitoring → expand only if early segments succeed
The pace of iteration in network automation can be fast — but production iterations must be low-impact by design, not by luck.
“Automate first, validate later”
In low-risk environments, automating first and validating later can seem rational: ship something imperfect and fix it from real feedback.
In ISP networks, that inverts the correct order. Network automation must be validated before it reaches production, not in production.
That implies a level of testing rigor uncommon in early network automation projects:
- Functional lab testing: the agent does what it should under normal conditions
- Edge-case testing: what happens when the device does not respond, when the SSH session drops mid-change, when post-change validation sees an unexpected state
- Interoperability testing: the agent works with every firmware version and every device model in production
- Rollback testing: the rollback mechanism actually reverts the change under the conditions where it is needed
The temptation to skip testing because “the lab does not exactly mirror production” is real. The right answer is to improve the lab, not skip testing.
“Trust the system, not people”
This SRE/DevOps idea means replacing processes that depend on correct human intervention with systems that guarantee correct outcomes regardless of the operator.
In ISP networks, that is right in principle but needs a caveat: ISP network automation systems need a clear, always-available human override.
A well-designed network automation agent should:
- Have a pause or cancel mechanism any engineer can trigger immediately
- Require explicit human approval for high-impact changes (routing policy changes, core device configuration changes)
- Generate alerts a human operator sees before the agent completes critical changes
Automation reduces dependence on human intervention for routine tasks. It does not remove human judgment for high-impact decisions.
Principles that do work in high risk
Principle 1: Small, frequent, verifiable changes
The biggest risk in a network change is not the complexity of the operation itself — it is the size of the delta between the previous and the new state. A change that touches 80 parameters on 200 devices at once is exponentially harder to diagnose if something goes wrong than one that touches 3 parameters on 10 devices.
Network automation makes small, frequent changes efficient. Using that capability to shrink each individual change is a risk-reduction strategy, not extra work.
Principle 2: Validation at multiple points in the change
A well-designed network change validates at three moments:
Pre-change: the network state before the change is as expected. Conditions to run the change are present. No active incidents affect the involved devices.
During change: the agent watches impact indicators while applying configuration. If indicators cross a threshold, the agent stops the rollout and waits for confirmation.
Post-change: the network state after the change is as expected. Affected services still work. No new incidents were caused by the change.
This three-point validation turns the change from “apply and pray” into a verifiable process with objective success criteria.
Principle 3: Blast radius limited by design
Every network automation system should have explicit limits on how many devices one run can affect, regardless of what is requested.
These limits are not user preferences — they are system constraints:
# Límites hardcodeados en el agente de automatización
max_devices_per_run: 50
max_concurrent_changes: 5
max_percentage_of_fleet: 10
require_approval_above: 20
If an operator wants to apply a change to all 600 devices, the system must require multiple cycles with validation between each, not one massive run.
Principle 4: Observability before automation
You cannot automate reliably what you cannot observe. Before automating any process on a high-risk ISP network, the team needs clear visibility into:
- Routing session state (BGP, IS-IS, OSPF) in real time
- Per-link traffic indicators (utilization, drops, errors)
- Configuration change logs from all devices
- Relevant protocol events (BGP state changes, interface flaps, IGP convergence)
If an automated change hurts any of these indicators, the team must know in seconds, not minutes. Detection latency in ISP networks is the critical factor: incidents that take 10 minutes to detect already have significant user impact.
The team profile that can run high-risk automation
Automation does not reduce the need for engineers with deep technical judgment. It raises it.
A high-risk ISP network automation system must be designed, maintained, and operated by engineers who deeply understand the network protocols the system manipulates. Automation does not hide network complexity — it exposes it differently.
The right profile combines:
- Networking knowledge: what each configuration change actually does and its control- and data-plane implications
- Software engineering judgment: systems that handle errors correctly, have tests, and are maintainable
- Operational safety culture: no shortcuts that trade reliability for convenience
That profile is uncommon. It is why ISP high-risk network automation benefits from working with external specialists with experience in this domain.
How Ayuda.LA can help
At Ayuda.LA we design and implement network automation for ISPs in Latin America with high-risk criteria built in from the design. We do not believe in automation at any cost — we believe in automation that reduces operational risk instead of increasing it.
If you are evaluating a network automation project and want to ensure it fits your production environment, we can do a design review before you move forward.
Let’s talk about your project →
Automation that fails in production is not better than no automation. It is worse: it creates the illusion of control without the reality.