HLD: Design Container Management System (Manage / Move / Deploy Containers Across Networks)
Think of this as a simplified version of Kubernetes + Cluster Federation, where containers can be deployed, migrated, monitored, and managed across multiple data centers, cloud providers, and networks
Functional Requirements
Deploy containers on available compute nodes using Docker/OCI compliant images.
Migrate running containers across nodes or clusters while preserving application state and network connectivity.
Manage container lifecycle operations including start, stop, restart, update, and scaling.
Create and manage virtual networks, service discovery, and cross-network communication between containers.
Continuously monitor container and node health, collect metrics, detect failures, and trigger automatic recovery actions.
Non-Functional Requirements
Ensure high availability with a target uptime of at least 99.99%.
Scale to support up to 100,000 nodes and 1 million running containers.
Achieve container deployment and scheduling latency of less than 5 seconds.
Provide fault tolerance such that node, network, or service failures do not cause application downtime.
Enforce enterprise-grade security through RBAC, TLS-encrypted communication, image signing, verification, and audit logging.
Core APIs
Deploy Container
POST /v1/containersRequest
{
"image":"nginx:latest",
"cpu":"2",
"memory":"4GB",
"network_id":"net-123",
"replicas":3
}Response
{
"container_id":"ctr-123",
"status":"DEPLOYING"
}Get Container
GET /v1/containers/{containerId}Response
{
"container_id":"ctr-123",
"status":"RUNNING",
"node_id":"node-45",
"ip":"10.0.1.20"
}Migrate Container
POST /v1/containers/{containerId}/migrateRequest
{
"target_node":"node-99"
}Response
{
"migration_id":"mig-111",
"status":"IN_PROGRESS"
}Scale Container
POST /v1/containers/{containerId}/scaleRequest
{
"replicas":20
}Response
{
"status":"SCALING"
}Create Network
POST /v1/networksRequest
{
"name":"payment-network",
"cidr":"10.10.0.0/16"
}Response
{
"network_id":"net-123"
}Database Choice
| Component | DB |
| ------------------ | ------------- |
| Container Metadata | PostgreSQL |
| Cluster State | etcd |
| Metrics | Cassandra |
| Logs | Elasticsearch |
| Cache | Redis |
Database Schemas
Container
{
"container_id":"ctr-123",
"image":"nginx:latest",
"status":"RUNNING",
"node_id":"node-45",
"network_id":"net-123",
"cpu":"2",
"memory":"4GB",
"created_at":"timestamp"
}Node
{
"node_id":"node-45",
"hostname":"worker-1",
"ip":"192.168.1.10",
"cpu_total":"64",
"cpu_available":"20",
"memory_total":"256GB",
"memory_available":"120GB",
"status":"HEALTHY"
}Cluster
{
"cluster_id":"cluster-1",
"name":"us-east-cluster",
"region":"us-east",
"status":"ACTIVE"
}Network
{
"network_id":"net-123",
"name":"payment-network",
"cidr":"10.10.0.0/16",
"gateway":"10.10.0.1"
}Migration Job
{
"migration_id":"mig-111",
"container_id":"ctr-123",
"source_node":"node-45",
"target_node":"node-99",
"status":"IN_PROGRESS",
"created_at":"timestamp"
}High Level Components
+------------------+
| User |
+---------+--------+
|
v
+------------------+
| API Gateway |
+---------+--------+
|
-----------------------------------------
| | | |
v v v v
+----------------+ +--------------+ +---------------+
| Deployment | | Migration | | Network |
| Service | | Service | | Service |
+-------+--------+ +------+-------+ +-------+-------+
| | |
| | |
v v v
+--------------------------------------------------+
| Scheduler Service |
+--------------------------------------------------+
|
v
+--------------------------------------------------+
| Cluster Manager |
+--------------------------------------------------+
|
-----------------------------------
| | |
v v v
+------------+ +-------------+ +-------------+
| Node Agent | | Node Agent | | Node Agent |
| Node-1 | | Node-2 | | Node-N |
+------------+ +-------------+ +-------------+Container Deployment Flow
User
|
| Deploy Container
v
API Gateway
|
v
Deployment Service
|
v
Scheduler
|
| Find best node
v
Cluster Manager
|
v
Node Agent
|
| Pull Image
| Create Container
v
Container Runtime
|
v
Container RunningContainer Migration Flow
User
|
| Migrate Container
v
Migration Service
|
v
Scheduler
|
| Select Target Node
v
Checkpoint Service
|
| Save Container State
v
Target Node Agent
|
| Restore State
v
Network Service
|
| Update Routing
v
Container Running on New NodeMicroservices
1. API Gateway
Authentication
Rate limiting
Routing
2. Deployment Service
Container deployment
Scaling requests
3. Scheduler Service
Node selection
Resource allocation
4. Migration Service
Container relocation
State transfer
5. Network Service
Overlay networking
Service discovery
6. Cluster Manager
Cluster state management
Node registration
7. Health Monitoring Service
Metrics collection
Alerting
8. Auto-Healing Service
Restart failed containers
Reschedule workloads
9. Image Registry Service
Image storage
Image versioning
10. Audit & Security Service
RBAC
Secrets management
Compliance logs
Microservice Interaction
+----------------+
| API Gateway |
+-------+--------+
|
------------------------------------------------
| | | |
v v v v
Deployment Migration Network Monitoring
Service Service Service Service
| | | |
------------------------------------------------
|
v
Scheduler
|
v
Cluster Manager
|
---------------------------------
| | |
v v v
Node Agent Node Agent Node Agent
| | |
---------------------------------
|
v
Container Runtime
(containerd / Docker)Bottlenecks & Scaling
Scheduler Bottleneck
Use sharded schedulers.
Leader election via etcd.
Metrics Explosion
Store metrics in Cassandra.
Aggregate before persistence.
Container Migration
Use incremental checkpoints.
Transfer only changed memory pages.
Multi-Region Networking
Overlay network (VXLAN/WireGuard).
Global service discovery.
Cluster State
Store in etcd with Raft consensus.
Multiple replicas for HA.
This design is very close to how modern orchestration platforms such as Kubernetes, Nomad, and Docker Swarm manage large-scale container deployments across clusters and networks.

