HLD: Automated Fault Detection & Prevention System (Router Monitoring Platform)(Asked in Microsoft)
Design a scalable platform that continuously monitors routers/network devices, detects anomalies or failures in real time, predicts potential issues, and automatically triggers preventive/remediation.
If you’re looking for paid 1:1 mentorship with a strong focus on LLD (core emphasis on Multithreading), HLD, DSA, and system design research papers, feel free to reach out.
📩 Contact: programmingappliedai@gmail.com
Example:
Detect high CPU spikes on routers
Predict link failures
Detect packet drops or abnormal latency
Restart interfaces automatically
Shift traffic to healthy routes
2. Functional Requirements (FR)
1. The system should continuously collect router telemetry such as CPU utilization, memory usage, packet drops, interface statistics, link status, temperature, and logs/SNMP traps at configurable intervals.
2. The system should detect failures and anomalies like router downtime, link degradation, interface flapping, traffic anomalies, high latency, and packet loss within a few seconds.
3. The system should automatically execute preventive or corrective actions such as restarting interfaces, switching routes, triggering failover, throttling traffic, or rebooting unhealthy devices based on predefined policies.
4. The system should generate alerts and incidents through channels like email, Slack, PagerDuty, and SMS for critical network issues and remediation events.
5. The system should store historical telemetry and fault data to support trend analysis, root cause analysis, capacity planning, and ML/rule-based failure prediction.
3. Non-Functional Requirements (NFR)
NFR1. High Availability
Monitoring system should remain operational even during node failures.
Target: 99.99% availability.
NFR2. Low Detection Latency
Faults should be detected within:
2–5 seconds for critical failures
<1 second for heartbeat failures
NFR3. Massive Scalability
Support:
Millions of telemetry events/sec
100K+ routers
Multi-region deployments
NFR4. Fault Tolerance
No data loss during:
Broker failures
Service crashes
Network partitions
NFR5. Extensibility
System should support:
New device types
New telemetry protocols
Custom remediation plugins
without major redesign.
4. High Level Architecture
+----------------+
| Admin UI |
+--------+-------+
|
v
+----------------+
| API Gateway |
+--------+-------+
|
------------------------------------------------
| | | |
v v v v
+------------+ +-------------+ +-------------+ +--------------+
| Device Mgmt| | Alert Svc | | Policy Svc | | Analytics Svc|
+------------+ +-------------+ +-------------+ +--------------+
|
v
+------------------+
| Kafka / Pulsar |
+--------+---------+
|
---------------------------------------------------
| | | |
v v v v
+------------+ +--------------+ +-------------+ +--------------+
| Collectors | | Detection Svc| | ML Engine | | Remediation |
+------------+ +--------------+ +-------------+ +--------------+
|
v
+---------------+
| Routers |
+---------------+5. Core Components
A. Telemetry Collectors
Responsible for:
SNMP polling
Streaming telemetry
Syslog ingestion
NetFlow/sFlow collection
Protocols:
SNMP
gRPC telemetry
NETCONF
SSH
Syslog
B. Event Streaming Layer
Use:
Apache Kafka
orApache Pulsar
Purpose:
Buffer telemetry
Decouple services
Enable replay
Handle burst traffic
Topics:
telemetry.cpu
telemetry.memory
telemetry.interface
alerts
remediation.tasksC. Detection Engine
Performs:
Threshold detection
Pattern detection
Correlation
Stateful analysis
Examples:
CPU > 90% for 5 mins
Packet loss > 20%
Interface flapping > 5 times/min
Techniques:
Rule engine
CEP (Complex Event Processing)
D. ML Prediction Engine
Predicts:
Link failure probability
Hardware degradation
Congestion patterns
Models:
Time-series forecasting
Isolation Forest
LSTM
ARIMA
E. Remediation Service
Executes automated actions:
Route failover
Interface reset
Restart BGP sessions
Traffic rerouting
Must support:
Rollback
Audit logs
Approval workflows
F. Alerting Service
Integrations:
PagerDuty
Slack
Email/SMS systems
Supports:
Deduplication
Escalation
Suppression
6. APIs
A. Register Router API
Request
POST /api/v1/routers{
"routerId": "RTR-1001",
"hostname": "blr-edge-1",
"ip": "10.0.0.5",
"location": "Bangalore-DC",
"vendor": "Cisco",
"model": "ASR-9000"
}Response
{
"status": "REGISTERED",
"routerId": "RTR-1001"
}B. Push Telemetry API
POST /api/v1/telemetryRequest
{
"routerId": "RTR-1001",
"timestamp": 1710001231,
"metrics": {
"cpu": 91,
"memory": 72,
"packetLoss": 15,
"latency": 240
}
}Response
{
"status": "INGESTED"
}C. Create Remediation Policy
POST /api/v1/policiesRequest
{
"policyName": "HighCPURecovery",
"condition": "cpu > 90 for 5m",
"action": "restart_interface",
"severity": "HIGH"
}Response
{
"policyId": "POL-101",
"status": "ACTIVE"
}D. Get Active Alerts
GET /api/v1/alerts?severity=CRITICALResponse
[
{
"alertId": "ALT-1001",
"routerId": "RTR-1001",
"message": "Packet loss exceeded threshold",
"severity": "CRITICAL",
"timestamp": 1710009999
}
]E. Execute Manual Remediation
POST /api/v1/remediation/executeRequest
{
"routerId": "RTR-1001",
"action": "restart_bgp"
}Response
{
"executionId": "REM-1009",
"status": "IN_PROGRESS"
}7. Database Design
A. Router Metadata Table
CREATE TABLE routers (
router_id VARCHAR(50) PRIMARY KEY,
hostname VARCHAR(255),
ip_address VARCHAR(50),
vendor VARCHAR(50),
model VARCHAR(50),
location VARCHAR(255),
status VARCHAR(20),
created_at TIMESTAMP
);B. Telemetry Table (Time-Series DB)
CREATE TABLE telemetry_metrics (
router_id VARCHAR(50),
metric_name VARCHAR(50),
metric_value DOUBLE,
ts TIMESTAMP,
PRIMARY KEY(router_id, ts, metric_name)
);Recommended DB:
InfluxDB
OpenTSDB
C. Alert Table
CREATE TABLE alerts (
alert_id VARCHAR(50) PRIMARY KEY,
router_id VARCHAR(50),
severity VARCHAR(20),
message TEXT,
status VARCHAR(20),
created_at TIMESTAMP
);D. Policy Table
CREATE TABLE remediation_policies (
policy_id VARCHAR(50) PRIMARY KEY,
policy_name VARCHAR(255),
condition_expression TEXT,
action VARCHAR(100),
severity VARCHAR(20),
enabled BOOLEAN
);E. Remediation Execution Table
CREATE TABLE remediation_execution (
execution_id VARCHAR(50) PRIMARY KEY,
router_id VARCHAR(50),
action VARCHAR(100),
status VARCHAR(50),
started_at TIMESTAMP,
completed_at TIMESTAMP,
rollback_status VARCHAR(50)
);8. Storage Choices
| Data Type | Database |
| ----------------- | -------------------- |
| Router Metadata | PostgreSQL |
| Telemetry Metrics | InfluxDB / Cassandra |
| Logs | Elasticsearch |
| Alerts | PostgreSQL |
| Event Streaming | Kafka |
| ML Features | S3/HDFS |
9. Microservices
| Service | Responsibility |
| --------------------------- | ----------------------------- |
| API Gateway | Authentication, routing |
| Device Service | Router metadata management |
| Telemetry Ingestion Service | Accept telemetry |
| Detection Engine | Fault detection |
| Policy Engine | Evaluate remediation policies |
| Remediation Service | Execute preventive actions |
| Alert Service | Notifications |
| Analytics Service | Dashboards and trends |
| ML Prediction Service | Predict future failures |
10. Microservices Interaction Flow
Scenario:
CPU spike detected on router.
1. Router sends telemetry
2. Collector publishes to Kafka
3. Detection Engine consumes telemetry
4. Rule matched: CPU > 90%
5. Alert generated
6. Policy Engine checks remediation rules
7. Remediation Service restarts interface
8. Result published back to Kafka
9. Alert Service sends PagerDuty alert
10. Analytics Service updates dashboard11. Sequence Flow Diagram
Router
|
| Telemetry
v
Collector Service
|
| Publish Event
v
Kafka
|
| Consume
v
Detection Engine
|
| Fault Detected
v
Policy Engine
|
| Execute Action
v
Remediation Service
|
| SSH/NETCONF
v
Router
Detection Engine
|
| Create Alert
v
Alert Service
|
| Notify
v
PagerDuty / Slack / Email12. Scaling Strategy
Horizontal Scaling
Multiple collectors
Partitioned Kafka topics
Stateless detection workers
Kafka Partitioning
Partition by:
hash(routerId)Ensures:
Ordering per router
Balanced processing
Telemetry Optimization
Use:
Compression
Batch ingestion
Sampling
Aggregation windows
13. Reliability & Fault Tolerance
Retry Mechanisms
Exponential backoff
Dead-letter queues
Exactly Once Processing
Use:
Kafka transactions
Idempotent consumers
Multi-Region Deployment
Deploy:
Active-active clusters
Geo-replication
14. Security Considerations
mTLS between routers and collectors
RBAC for remediation APIs
Encrypted telemetry
Audit logs for all actions
Secret management using:
HashiCorp Vault
15. Bottlenecks & Optimizations
| Bottleneck | Optimization |
| --------------------- | ------------------- |
| High telemetry volume | Kafka batching |
| Detection lag | Stream processing |
| Alert storms | Alert deduplication |
| Expensive queries | Pre-aggregation |
| Large storage | Tiered retention |
16. Interview Deep Dive Questions
How will you detect interface flapping efficiently?
Why choose Kafka over RabbitMQ?
How would you avoid alert storms?
How will rollback work during failed remediation?
How do you guarantee ordering per router?
How would ML-based prediction integrate?
How would the system behave during Kafka outages?
How do you prevent false positives?
How would you support millions of routers?
How would you design multi-region failover?

