From proof of concept to production: how to build a network automation agent that does not fail
The scenario repeats often in network teams: a motivated engineer builds a lab, constructs a network automation agent with Python and Ansible, runs changes on test devices, and the system works flawlessly. The team approves the PoC. They decide to take the agent to production.
Three weeks later, the agent is paused. It caused a minor incident by assuming a network state that was no longer true. The team went back to manual changes “for safety.”
This pattern is not a failure of the tool or the engineer. It is the result of not designing the agent for real production conditions from the start. The difference between a PoC that works in a lab and an agent that runs in production without failing is not in the script logic — it is in how the system handles uncertainty, errors, and unexpected states.
Why the PoC works and production does not
A network automation lab has conditions that rarely hold in production:
Known, controlled state. In the lab, the engineer knows exactly each device’s state before running the agent. In production, that is not always true. A device may be mid-firmware upgrade, have an SSH session open from another operator, or carry configuration changed manually hours earlier without updating the source of truth.
Stable topology. In the lab there are no active incidents, interfaces down due to third-party maintenance, or network segments with high latency that make connection timeouts fail randomly.
No side effects. In the lab, if something goes wrong, you reset the virtual device and start over. In production, a half-applied command can leave a device in an inconsistent state that is hard to diagnose.
Single operator. In the lab, only the agent touches the devices. In production, engineers make manual changes, monitoring scripts open sessions, and backup systems compete for access.
The PoC does not fail because it is badly designed. It fails because it was designed for lab conditions, not production conditions.
The four problems that kill agents in production
1. No pre-validation of state
The most common error: the agent assumes the network state matches the source of truth (NetBox, a YAML file, a database) without verifying before execution.
The problem: The source of truth drifts. A device that NetBox says has an active BGP session may be down due to unrecorded maintenance. If the agent makes changes assuming that session is up, it can cause an outage.
The solution: Every agent operation must start with a pre-validation phase that reads current device state and compares it to expected state. On discrepancy, the agent must stop and alert — never continue blindly.
def pre_validate(device, expected_state):
current_state = get_device_state(device)
discrepancies = compare_states(current_state, expected_state)
if discrepancies:
raise PreValidationError(
f"Estado inesperado en {device}: {discrepancies}. Operación abortada."
)
return True
2. Lack of idempotence
An idempotent agent can run multiple times and produce the same result. A non-idempotent agent can cause problems if it runs twice — from an automatic retry, an operator rerunning it, or a failure halfway through.
The problem: If the agent adds an ACL entry and runs twice, it adds the same entry twice. Depending on the vendor, that may be harmless or cause unexpected behavior.
The solution: Before applying any change, verify whether the desired state already exists. If it does, do not apply the change. This “verify before change” pattern is the basis of idempotence in network automation.
Tools like Ansible build idempotence into network modules. If you write Python scripts with Netmiko, idempotence is explicitly your responsibility.
3. No automated rollback
Every production change must have rollback executable in seconds, not minutes. In automation, manual rollback does not scale: if the agent applied changes to 50 devices and something went wrong, you cannot manually roll back 50 devices in time.
The problem: Typical PoC agents do not implement rollback. The engineer who built it assumed someone would revert manually if needed.
The solution: Before applying any change, capture current device configuration (or the fragment being modified). If post-change validation fails, the agent must automatically restore that captured state.
def apply_with_rollback(device, change):
snapshot = capture_config_snapshot(device)
try:
apply_change(device, change)
if not post_validate(device, change.expected_state):
raise PostValidationError("Validación post-cambio fallida")
except Exception as e:
logger.error(f"Falla en {device}, iniciando rollback: {e}")
restore_snapshot(device, snapshot)
raise
4. Poor handling of partial failures
An agent that applies changes to multiple devices sequentially or in parallel can fail halfway. If the first 20 devices received the change and the last 30 did not, the network is in an inconsistent state.
The problem: The PoC agent exits abruptly on the first error and leaves work half done. There is no clear notification of how many devices were modified and how many were not.
The solution: Define explicit behavior on partial failures before writing the agent:
- Fail fast: on first error, stop everything and roll back changes already applied. Appropriate for interdependent changes.
- Fail slow: continue with remaining devices, log errors, and report full results at the end. Appropriate for independent per-device changes.
- Failure threshold: continue while the failure rate stays below a configured threshold (for example, abort if more than 10% of devices fail).
The production-ready agent model
A production-ready network automation agent has this structure:
1. INVENTARIO → Obtener lista de equipos desde la fuente de verdad
2. PRE-VALIDACIÓN → Verificar estado actual de cada equipo
3. SNAPSHOT → Capturar configuración actual antes de cambiar
4. EJECUCIÓN → Aplicar cambios con manejo de errores y límites de concurrencia
5. POST-VALIDACIÓN → Verificar que el estado post-cambio es el esperado
6. ROLLBACK → Revertir automáticamente si la validación falla
7. REPORTE → Registrar resultado detallado por equipo y notificar
Each phase is explicit, independently testable, and has defined failure behavior.
Taking the agent to production: the rollout plan
Rolling a network automation agent into production should be gradual:
Week 1 — Read-only only. The agent runs against production devices in read-only mode only. It pre-validates, generates snapshots, verifies state, and reports — without applying changes. This validates that the agent can reach all devices and that pre-validation works against real network state.
Weeks 2–3 — One low-risk segment. The agent applies changes to a small set (5–10) of lower-criticality devices, with an engineer monitoring execution in real time.
Week 4 onward — Gradual expansion. Increase agent scope based on results. Define clear success metrics: failure rate, execution time, number of rollbacks triggered.
From tool to infrastructure
A network automation agent that moves to production stops being one engineer’s tool and becomes operational infrastructure. That means it needs:
- Versioned code and review. All agent changes go through pull requests and review, like application code.
- Automated tests. A test suite that validates the agent before each deployment.
- Monitoring and alerts. The agent itself is monitored: executions are logged, failures generate alerts, and performance is tracked over time.
- Operational documentation. The agent has a runbook any team member can follow to operate it, diagnose failures, and run manual rollbacks if needed.
Network automation does not fail because the technology is bad. It fails when it is treated as a personal script instead of production infrastructure.
How Ayuda.LA can help
At Ayuda.LA we design and implement network automation agents for ISPs and enterprises in Latin America that are built for production from day one. We do not deliver PoCs — we deliver operational systems with validation, rollback, and monitoring built in.
If you have a stalled automation project because the PoC does not scale, or you want to build network automation from the ground up with production criteria, let’s talk.
The difference between a PoC and a production system is not how much code it has — it is how many ways it can fail that it accounts for.