HLD: Design Container Management System (Manage / Move / Deploy Containers Across Networks)

Think of this as a simplified version of Kubernetes + Cluster Federation, where containers can be deployed, migrated, monitored, and managed across multiple data centers, cloud providers, and networks

Shashank Mishra

Jun 08, 2026

Functional Requirements

Deploy containers on available compute nodes using Docker/OCI compliant images.
Migrate running containers across nodes or clusters while preserving application state and network connectivity.
Manage container lifecycle operations including start, stop, restart, update, and scaling.
Create and manage virtual networks, service discovery, and cross-network communication between containers.
Continuously monitor container and node health, collect metrics, detect failures, and trigger automatic recovery actions.

Non-Functional Requirements

Ensure high availability with a target uptime of at least 99.99%.
Scale to support up to 100,000 nodes and 1 million running containers.
Achieve container deployment and scheduling latency of less than 5 seconds.
Provide fault tolerance such that node, network, or service failures do not cause application downtime.
Enforce enterprise-grade security through RBAC, TLS-encrypted communication, image signing, verification, and audit logging.

Core APIs

Deploy Container

POST /v1/containers

Request

{
  "image":"nginx:latest",
  "cpu":"2",
  "memory":"4GB",
  "network_id":"net-123",
  "replicas":3
}

Response

{
  "container_id":"ctr-123",
  "status":"DEPLOYING"
}

Get Container

GET /v1/containers/{containerId}

Response

{
  "container_id":"ctr-123",
  "status":"RUNNING",
  "node_id":"node-45",
  "ip":"10.0.1.20"
}

Migrate Container

POST /v1/containers/{containerId}/migrate

Request

{
  "target_node":"node-99"
}

Response

{
  "migration_id":"mig-111",
  "status":"IN_PROGRESS"
}

Scale Container

POST /v1/containers/{containerId}/scale

Request

{
  "replicas":20
}

Response

{
  "status":"SCALING"
}

Create Network

POST /v1/networks

Request

{
  "name":"payment-network",
  "cidr":"10.10.0.0/16"
}

Response

{
  "network_id":"net-123"
}

Database Choice

| Component          | DB            |
| ------------------ | ------------- |
| Container Metadata | PostgreSQL    |
| Cluster State      | etcd          |
| Metrics            | Cassandra     |
| Logs               | Elasticsearch |
| Cache              | Redis         |

Database Schemas

Container

{
  "container_id":"ctr-123",
  "image":"nginx:latest",
  "status":"RUNNING",
  "node_id":"node-45",
  "network_id":"net-123",
  "cpu":"2",
  "memory":"4GB",
  "created_at":"timestamp"
}

Node

{
  "node_id":"node-45",
  "hostname":"worker-1",
  "ip":"192.168.1.10",
  "cpu_total":"64",
  "cpu_available":"20",
  "memory_total":"256GB",
  "memory_available":"120GB",
  "status":"HEALTHY"
}

Cluster

{
  "cluster_id":"cluster-1",
  "name":"us-east-cluster",
  "region":"us-east",
  "status":"ACTIVE"
}

Network

{
  "network_id":"net-123",
  "name":"payment-network",
  "cidr":"10.10.0.0/16",
  "gateway":"10.10.0.1"
}

Migration Job

{
  "migration_id":"mig-111",
  "container_id":"ctr-123",
  "source_node":"node-45",
  "target_node":"node-99",
  "status":"IN_PROGRESS",
  "created_at":"timestamp"
}

High Level Components

                    +------------------+
                    |      User        |
                    +---------+--------+
                              |
                              v
                    +------------------+
                    |    API Gateway   |
                    +---------+--------+
                              |
          -----------------------------------------
          |              |            |           |
          v              v            v           v

+----------------+ +--------------+ +---------------+
| Deployment     | | Migration    | | Network       |
| Service        | | Service      | | Service       |
+-------+--------+ +------+-------+ +-------+-------+
        |                 |                 |
        |                 |                 |
        v                 v                 v

+--------------------------------------------------+
|              Scheduler Service                   |
+--------------------------------------------------+
                        |
                        v

+--------------------------------------------------+
|              Cluster Manager                     |
+--------------------------------------------------+
                        |
        -----------------------------------
        |                |               |
        v                v               v

+------------+ +-------------+ +-------------+
| Node Agent | | Node Agent  | | Node Agent  |
| Node-1     | | Node-2      | | Node-N      |
+------------+ +-------------+ +-------------+

Container Deployment Flow

User
 |
 | Deploy Container
 v

API Gateway
 |
 v

Deployment Service
 |
 v

Scheduler
 |
 | Find best node
 v

Cluster Manager
 |
 v

Node Agent
 |
 | Pull Image
 | Create Container
 v

Container Runtime
 |
 v

Container Running

Container Migration Flow

User
 |
 | Migrate Container
 v

Migration Service
 |
 v

Scheduler
 |
 | Select Target Node
 v

Checkpoint Service
 |
 | Save Container State
 v

Target Node Agent
 |
 | Restore State
 v

Network Service
 |
 | Update Routing
 v

Container Running on New Node

Microservices

1. API Gateway

Authentication
Rate limiting
Routing

2. Deployment Service

Container deployment
Scaling requests

3. Scheduler Service

Node selection
Resource allocation

4. Migration Service

Container relocation
State transfer

5. Network Service

Overlay networking
Service discovery

6. Cluster Manager

Cluster state management
Node registration

7. Health Monitoring Service

Metrics collection
Alerting

8. Auto-Healing Service

Restart failed containers
Reschedule workloads

9. Image Registry Service

Image storage
Image versioning

10. Audit & Security Service

RBAC
Secrets management
Compliance logs

Microservice Interaction

                 +----------------+
                 | API Gateway    |
                 +-------+--------+
                         |
     ------------------------------------------------
     |              |              |                |
     v              v              v                v

Deployment   Migration      Network         Monitoring
 Service      Service       Service           Service
     |              |             |                |
     ------------------------------------------------
                         |
                         v

                  Scheduler
                         |
                         v

                 Cluster Manager
                         |
         ---------------------------------
         |               |               |
         v               v               v

      Node Agent     Node Agent      Node Agent
         |               |               |
         ---------------------------------
                         |
                         v

                Container Runtime
             (containerd / Docker)

Bottlenecks & Scaling

Scheduler Bottleneck

Use sharded schedulers.
Leader election via etcd.

Metrics Explosion

Store metrics in Cassandra.
Aggregate before persistence.

Container Migration

Use incremental checkpoints.
Transfer only changed memory pages.

Multi-Region Networking

Overlay network (VXLAN/WireGuard).
Global service discovery.

Cluster State

Store in etcd with Raft consensus.
Multiple replicas for HA.

This design is very close to how modern orchestration platforms such as Kubernetes, Nomad, and Docker Swarm manage large-scale container deployments across clusters and networks.

Shashank’s Substack

Discussion about this post

Ready for more?