Design a system for sorting large data sets
Designing a system for sorting large datasets requires breaking down the problem into distributed computing, I/O efficiency, fault tolerance, and scalability. Let’s begin with the FR.
1.Users should be able to upload large datasets (possibly in GBs or TBs).
2.Allow users to specify:
Dataset to sort
Sort key(s)
Sort order (ascending / descending)
3.Support sorting based on one or multiple fields (e.g., sort by age, then name).
4.Should work even when data is too big to fit in memory.
5.Should use external/disk-based or distributed sorting.
6.If dataset is too large, it should be partitioned and sorted in parallel (e.g., using MapReduce, Spark).
7.View status of their sort job (queued, processing, completed, failed)
NFR
1.System should be highly scalable to handle increasing data sizes (e.g., from GBs to TBs or more).
2.Sorting time should be optimized for large files (e.g., use external sort, in-memory sort when feasible, distributed sort using Spark/MapReduce).
3.Must support concurrent sort jobs from multiple users.
4.High availability is preferred, especially if it's a service used in data pipelines or by multiple users.
5.System should be fault tolerant.If a node crashes during a sort, the system should retry or reassign the sort task.
6.System should have strong data consistency. Ensure sorted output is always consistent and deterministic for the same input and sort config.
Estimations
Dataset size per job: 1 TB
Average record size: 1 KB
Total records: 1 TB / 1 KB = 1 billion records
Sort key: a 10-byte field in each record
Output: Full sorted file in the same format, written back to disk/cloud
1. I/O Cost: Read + Write
Reading 1 TB of input:
1 TB / (10 nodes × 100 MB/s) = ~102.4 seconds
Writing 1 TB of output:
Same as read: ~102.4 seconds
Total disk I/O time ≈ 204.8 seconds (~3.5 minutes)
API’s
POST /datasets/upload
Request:
Headers:
Authorization
,Content-Type: multipart/form-data
Body:
File: dataset file (CSV, JSON, etc.)
Metadata:
request {
"name": "orders_dataset.csv",
"format": "csv",
"description": "Raw order data"
}
response {
"datasetId": "ds_12345",
"status": "uploaded"
}
2.POST /sort-jobs
request {
"datasetId": "ds_12345",
"sortFields": [
{ "field": "amount", "order": "desc" },
{ "field": "timestamp", "order": "asc" }
],
"outputFormat": "csv"
}
response {
"jobId": "job_7890",
"status": "queued"
}
3.GET /sort-jobs/{jobId}
{
"jobId": "job_7890",
"datasetId": "ds_12345",
"status": "in_progress", // or: queued, failed, completed
"progress": 65,
"startedAt": "2025-04-04T10:15:00Z",
"estimatedCompletion": "2025-04-04T10:30:00Z"
}
4.GET /sort-jobs
Description: List all sort jobs for a user.
Query Params:
status
(optional):queued
,in_progress
,completed
,failed
[
{
"jobId": "job_7890",
"datasetId": "ds_12345",
"status": "completed",
"createdAt": "2025-04-04T10:00:00Z"
},
...
]
5.GET /sort-jobs/{jobId}/download
Description: Download the result of a completed sort job.
Response:
Returns the sorted file in specified format.
6.DELETE /datasets/{datasetId}
{
"message": "Dataset deleted successfully"
}
Databases
For a large dataset sorting system, we'll need multiple databases (or logical storage components) optimized for different workloads:
Metadata and Job Tracking → Relational DB (like PostgreSQL)
Large File Storage → Object Store (like S3/HDFS)
Job Queue/Execution Management → Distributed Task Queue (like Redis/RabbitMQ)
Optional: Logs and metrics → Time-series DB or log system (like Prometheus, ELK, etc.)
2. Object Store – For Dataset and Output Files
Could use:
AWS S3
GCP Cloud Storage
HDFS / MinIO / Ceph (for on-prem)
Naming Convention
Uploaded dataset path:
s3://bucket-name/datasets/{user_id}/{dataset_id}.csv
Sorted output path:
s3://bucket-name/outputs/{job_id}/sorted.csv
You store just the URI in relational DB for mapping.
3. Redis / RabbitMQ / Kafka – Job Queue
Used for dispatching sort jobs to the worker nodes.
A message format could be:
{
"jobId": "job_7890",
"datasetPath": "s3://.../dataset.csv",
"sortFields": [{ "field": "amount", "order": "desc" }],
"outputPath": "s3://.../sorted.csv"
}
4. (Optional) Logs / Monitoring DB
If you want to track performance, system health, and per-job metrics:
Time-series DB: Prometheus, InfluxDB
Log Store: ELK Stack (Elasticsearch), CloudWatch, etc.
CREATE TABLE users (
id UUID PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE datasets (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
name TEXT NOT NULL,
format TEXT NOT NULL, -- e.g., csv, json
size_bytes BIGINT NOT NULL,
storage_path TEXT NOT NULL, -- e.g., S3 or HDFS URI
status TEXT NOT NULL DEFAULT 'uploaded', -- uploaded, deleted
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE sort_jobs (
id UUID PRIMARY KEY,
dataset_id UUID REFERENCES datasets(id),
user_id UUID REFERENCES users(id),
status TEXT NOT NULL DEFAULT 'queued', -- queued, in_progress, completed, failed
sort_fields JSONB NOT NULL, -- [{"field": "amount", "order": "desc"}]
output_format TEXT NOT NULL, -- csv, json
output_path TEXT, -- URI to sorted result (S3/HDFS)
progress INTEGER DEFAULT 0, -- 0 to 100
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_user_id ON datasets(user_id);
CREATE INDEX idx_dataset_id ON sort_jobs(dataset_id);
CREATE INDEX idx_sort_jobs_user_id ON sort_jobs(user_id);
Microservices Interaction Flow
Let’s walk through a typical end-to-end flow:
🧵 Step 1: Upload Dataset
Client → API Gateway → Dataset Service
Dataset Service uploads file via Storage Service to S3/HDFS and stores metadata in DB.
Response with
datasetId
.
🧵 Step 2: Create Sort Job
Client → API Gateway → Sort Job Service
Sort Job Service:
Validates dataset exists (via Dataset Service or shared DB)
Saves job metadata to DB
Publishes job to Scheduler/Queue
Returns
jobId
.
🧵 Step 3: Job Execution
Worker Service subscribes to queue (e.g., Redis, Kafka).
Pulls sort job → fetches file from Storage Service (S3/HDFS).
Performs sorting (can use Spark, local sort, etc.).
Stores result to S3/HDFS via Storage Service.
Updates job status in Sort Job Service DB.
Sends a notification to user via Notification Service.
🧵 Step 4: Check Job Status / Download
Client → API Gateway → Sort Job Service
Sort Job Service returns status, progress, and download link.
[ Client ]
|
v
[ API Gateway ]
|
┌───┼────────────┬─────────────┬──────────────┬────────────┐
│ v v v v │
│ [UserSvc] [DatasetSvc] [SortJobSvc] [StorageSvc] │
└───┬────────────┬─────────────┴──────────────┬────────────┘
v v v
[ Scheduler/Queue ] ←→ [ WorkerSvc ] ←→ [StorageSvc (S3/HDFS)]
|
v
[NotificationSvc]