Design a system for sorting large data sets

Designing a system for sorting large datasets requires breaking down the problem into distributed computing, I/O efficiency, fault tolerance, and scalability. Let’s begin with the FR.

Shashank Mishra

Apr 14, 2025

1.Users should be able to upload large datasets (possibly in GBs or TBs).

2.Allow users to specify:

Dataset to sort
Sort key(s)
Sort order (ascending / descending)

3.Support sorting based on one or multiple fields (e.g., sort by age, then name).

4.Should work even when data is too big to fit in memory.

5.Should use external/disk-based or distributed sorting.

6.If dataset is too large, it should be partitioned and sorted in parallel (e.g., using MapReduce, Spark).

7.View status of their sort job (queued, processing, completed, failed)

NFR

1.System should be highly scalable to handle increasing data sizes (e.g., from GBs to TBs or more).

2.Sorting time should be optimized for large files (e.g., use external sort, in-memory sort when feasible, distributed sort using Spark/MapReduce).

3.Must support concurrent sort jobs from multiple users.

4.High availability is preferred, especially if it's a service used in data pipelines or by multiple users.

5.System should be fault tolerant.If a node crashes during a sort, the system should retry or reassign the sort task.

6.System should have strong data consistency. Ensure sorted output is always consistent and deterministic for the same input and sort config.

Estimations

Dataset size per job: 1 TB

Average record size: 1 KB

Total records: 1 TB / 1 KB = 1 billion records

Sort key: a 10-byte field in each record

Output: Full sorted file in the same format, written back to disk/cloud

1. I/O Cost: Read + Write
Reading 1 TB of input:
1 TB / (10 nodes × 100 MB/s) = ~102.4 seconds

Writing 1 TB of output:
Same as read: ~102.4 seconds

Total disk I/O time ≈ 204.8 seconds (~3.5 minutes)

API’s

POST /datasets/upload

Request:

Headers: Authorization, Content-Type: multipart/form-data
Body:
- File: dataset file (CSV, JSON, etc.)
- Metadata:

request {
  "name": "orders_dataset.csv",
  "format": "csv",
  "description": "Raw order data"
}
response {
  "datasetId": "ds_12345",
  "status": "uploaded"
}

2.POST /sort-jobs

request {
  "datasetId": "ds_12345",
  "sortFields": [
    { "field": "amount", "order": "desc" },
    { "field": "timestamp", "order": "asc" }
  ],
  "outputFormat": "csv"
}

response {
  "jobId": "job_7890",
  "status": "queued"
}

3.GET /sort-jobs/{jobId}

{
  "jobId": "job_7890",
  "datasetId": "ds_12345",
  "status": "in_progress", // or: queued, failed, completed
  "progress": 65,
  "startedAt": "2025-04-04T10:15:00Z",
  "estimatedCompletion": "2025-04-04T10:30:00Z"
}

4.GET /sort-jobs

Description: List all sort jobs for a user.

Query Params:

status (optional): queued, in_progress, completed, failed

[
  {
    "jobId": "job_7890",
    "datasetId": "ds_12345",
    "status": "completed",
    "createdAt": "2025-04-04T10:00:00Z"
  },
  ...
]

5.GET /sort-jobs/{jobId}/download

Description: Download the result of a completed sort job.

Response:

Returns the sorted file in specified format.

6.DELETE /datasets/{datasetId}

{
  "message": "Dataset deleted successfully"
}

Databases

For a large dataset sorting system, we'll need multiple databases (or logical storage components) optimized for different workloads:

Metadata and Job Tracking → Relational DB (like PostgreSQL)
Large File Storage → Object Store (like S3/HDFS)
Job Queue/Execution Management → Distributed Task Queue (like Redis/RabbitMQ)
Optional: Logs and metrics → Time-series DB or log system (like Prometheus, ELK, etc.)

2. Object Store – For Dataset and Output Files

Could use:

AWS S3
GCP Cloud Storage
HDFS / MinIO / Ceph (for on-prem)

Naming Convention

Uploaded dataset path:
s3://bucket-name/datasets/{user_id}/{dataset_id}.csv
Sorted output path:
s3://bucket-name/outputs/{job_id}/sorted.csv

You store just the URI in relational DB for mapping.

3. Redis / RabbitMQ / Kafka – Job Queue

Used for dispatching sort jobs to the worker nodes.

A message format could be:

{
  "jobId": "job_7890",
  "datasetPath": "s3://.../dataset.csv",
  "sortFields": [{ "field": "amount", "order": "desc" }],
  "outputPath": "s3://.../sorted.csv"
}

4. (Optional) Logs / Monitoring DB

If you want to track performance, system health, and per-job metrics:

Time-series DB: Prometheus, InfluxDB
Log Store: ELK Stack (Elasticsearch), CloudWatch, etc.

CREATE TABLE users (
    id UUID PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE datasets (
    id UUID PRIMARY KEY,
    user_id UUID REFERENCES users(id),
    name TEXT NOT NULL,
    format TEXT NOT NULL, -- e.g., csv, json
    size_bytes BIGINT NOT NULL,
    storage_path TEXT NOT NULL, -- e.g., S3 or HDFS URI
    status TEXT NOT NULL DEFAULT 'uploaded', -- uploaded, deleted
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE sort_jobs (
    id UUID PRIMARY KEY,
    dataset_id UUID REFERENCES datasets(id),
    user_id UUID REFERENCES users(id),
    status TEXT NOT NULL DEFAULT 'queued', -- queued, in_progress, completed, failed
    sort_fields JSONB NOT NULL, -- [{"field": "amount", "order": "desc"}]
    output_format TEXT NOT NULL, -- csv, json
    output_path TEXT, -- URI to sorted result (S3/HDFS)
    progress INTEGER DEFAULT 0, -- 0 to 100
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_user_id ON datasets(user_id);
CREATE INDEX idx_dataset_id ON sort_jobs(dataset_id);
CREATE INDEX idx_sort_jobs_user_id ON sort_jobs(user_id);

Microservices Interaction Flow

Let’s walk through a typical end-to-end flow:

🧵 Step 1: Upload Dataset

Client → API Gateway → Dataset Service
Dataset Service uploads file via Storage Service to S3/HDFS and stores metadata in DB.
Response with datasetId.

🧵 Step 2: Create Sort Job

Client → API Gateway → Sort Job Service
Sort Job Service:
- Validates dataset exists (via Dataset Service or shared DB)
- Saves job metadata to DB
- Publishes job to Scheduler/Queue
Returns jobId.

🧵 Step 3: Job Execution

Worker Service subscribes to queue (e.g., Redis, Kafka).
Pulls sort job → fetches file from Storage Service (S3/HDFS).
Performs sorting (can use Spark, local sort, etc.).
Stores result to S3/HDFS via Storage Service.
Updates job status in Sort Job Service DB.
Sends a notification to user via Notification Service.

🧵 Step 4: Check Job Status / Download

Client → API Gateway → Sort Job Service
Sort Job Service returns status, progress, and download link.

[ Client ] 
     |
     v
[ API Gateway ]
     |
 ┌───┼────────────┬─────────────┬──────────────┬────────────┐
 │   v            v             v              v            │
 │ [UserSvc]  [DatasetSvc]  [SortJobSvc]   [StorageSvc]     │
 └───┬────────────┬─────────────┴──────────────┬────────────┘
     v            v                            v
[ Scheduler/Queue ] ←→ [ WorkerSvc ] ←→ [StorageSvc (S3/HDFS)]
                                   |
                                   v
                          [NotificationSvc]

Shashank’s Substack

Discussion about this post