Design a Trending Topics System with focus on: count-min sketch, sliding window, ranking

Breakdown of the functional requirements for a Trending Topics System with focus on count-min sketch, sliding window, and ranking:

Shashank Mishra

Oct 04, 2025

Let’s look into FR

1.System should ingest large-scale event streams (e.g., tweets, posts, hashtags, keywords).

2.Maintain approximate counts of occurrences of each topic using count-min sketch to reduce memory overhead.

3.Support queries like:

“How many times has topic X appeared in the last N minutes?”
“What are the top-K topics in the current window?”

4.Support time-based sliding windows (e.g., last 5 minutes, 1 hour) to determine currently trending rather than historically popular topics.

5.Rank topics based on frequency in the current sliding window.

6.Allow queries for trends at different time granularities:

e.g., top hashtags in last 10 min, 1 hour, 1 day.

NFR

1.New events should be reflected in trending results within a few seconds (sub-second ideal).

2.Handle high write throughput (millions of events per second globally).

3.Since count-min sketch is approximate, system must guarantee:

Upper bound on overestimation error (ε).
High probability of accuracy (1 − δ).

4.System should have high availability (99.9%+ SLA) even during peak traffic.

5.Eventual consistency acceptable (eventual consistency) since trending topics are approximate.

Back of envelope

Writes (Ingestion QPS)

Each incoming event = 1 write (update CMS + Top-K).

If ingestion rate = 1M events/sec, → Write QPS ≈ 1M.

Per shard (S = 10): 100k writes/sec.

Reads (Query QPS)

Typical queries:

Get top-K trending topics → reads Top-K heap (O(log K) or just cached list).

Get count of a topic → CMS lookup (O(d), ~7 reads).

Query traffic is usually much lower than ingestion (<< writes).

Assume ~10k queries/sec globally (common for API consumers, dashboards,

Storage

API’s

🔹 1. Ingest Event

Purpose: Push a new event (topic/keyword/hashtag) into the system.

POST /api/v1/events

{
  “topic”: “worldcup2025”,
  “userId”: “u12345”,
  “timestamp”: “2025-09-28T10:05:32Z”,
  “metadata”: {
    “lang”: “en”,
    “region”: “IN”,
    “source”: “twitter”
  }
}

response

{
  “status”: “success”,
  “message”: “Event ingested”,
  “topic”: “worldcup2025”
}

🔹 2. Get Count of a Topic (in window)

Purpose: Retrieve approximate frequency of a topic in a sliding window.

GET /api/v1/topics/{topic}/count?window=5m

Response

{
  “topic”: “worldcup2025”,
  “window”: “5m”,
  “count_estimate”: 128947,
  “error_bound”: “±0.1%”
}

🔹 3. Get Top-K Trending Topics

Purpose: Return top-K topics in a given time window.

GET /api/v1/trending?window=10m&top=5&region=IN

Response

{
  “window”: “10m”,
  “region”: “IN”,
  “top”: 5,
  “topics”: [
    { “topic”: “worldcup2025”, “count_estimate”: 1823947 },
    { “topic”: “newiphone”, “count_estimate”: 1593821 },
    { “topic”: “stockmarket”, “count_estimate”: 1293847 },
    { “topic”: “bigboss”, “count_estimate”: 1002834 },
    { “topic”: “climatechange”, “count_estimate”: 934827 }
  ],
  “error_bound”: “±0.1%”
}

🔹 4. Get Trend History

Purpose: Retrieve historical trend snapshots for a topic or for all topics.

GET /api/v1/topics/{topic}/history?window=1h&granularity=5m

Response

{
  “topic”: “worldcup2025”,
  “window”: “1h”,
  “granularity”: “5m”,
  “trend”: [
    { “timestamp”: “2025-09-28T09:00:00Z”, “count_estimate”: 18234 },
    { “timestamp”: “2025-09-28T09:05:00Z”, “count_estimate”: 20193 },
    { “timestamp”: “2025-09-28T09:10:00Z”, “count_estimate”: 21593 },
    { “timestamp”: “2025-09-28T09:15:00Z”, “count_estimate”: 22894 }
  ],
  “error_bound”: “±0.1%”
}

🔹 5. Admin/Monitoring APIs

Health Check

GET /api/v1/health

{
  “status”: “healthy”,
  “uptime_sec”: 102349,
  “ingestion_rate_eps”: 987654,
  “query_rate_qps”: 123
}

Metrics

GET /api/v1/metrics

{
  “cms_memory_mb”: 9.2,
  “topk_memory_mb”: 1.6,
  “window_size”: “60m”,
  “active_buckets”: 60,
  “error_bound”: “±0.1%”
}

Databases

🔹 Database Choices

We need to handle 3 different data types:

Fast-moving counters (CMS + Top-K)
- High write QPS (millions/sec).
- Memory-first data structure, backed by lightweight persistence.
- DB Role: Checkpoint & recovery.
- ✅ Recommended:
  - Redis / RocksDB for CMS + Top-K state persistence.
  - Or Kafka log + periodic snapshots to S3 (durability).
Raw Event Stream (optional retention)
- If you need to reprocess or audit raw events.
- ✅ Recommended:
  - Kafka / Pulsar for streaming ingestion.
  - S3 / HDFS for long-term storage.
Trend History (aggregated results)
- Lower write rate (per minute/hour).
- Queried frequently for dashboards & APIs.
- ✅ Recommended:
  - Time-series DB (TimescaleDB, ClickHouse, InfluxDB) → optimized for window queries.
  - Or Cassandra/DynamoDB for wide-column storage.

📌 SQL Schemas

1. Raw Events (optional, for audit/replay)

CREATE TABLE events (
    topic_id      VARCHAR(255) NOT NULL,
    event_time    TIMESTAMP NOT NULL,
    user_id       VARCHAR(255),
    region        VARCHAR(50),
    lang          VARCHAR(10),
    source        VARCHAR(50),
    payload       JSONB,
    PRIMARY KEY (topic_id, event_time)
);

2. Count-Min Sketch Buckets (serialized state)

CREATE TABLE cms_buckets (
    bucket_id     BIGINT NOT NULL,          -- epoch_minute
    shard_id      INT NOT NULL,
    width         INT NOT NULL,
    depth         INT NOT NULL,
    counters      BYTEA NOT NULL,           -- serialized 2D array
    updated_at    TIMESTAMP NOT NULL,
    PRIMARY KEY (shard_id, bucket_id)
);

3. Top-K Topics per Window

CREATE TABLE topk_topics (
    window_start   TIMESTAMP NOT NULL,
    window_end     TIMESTAMP NOT NULL,
    region         VARCHAR(50) NOT NULL,
    rank           INT NOT NULL,
    topic          VARCHAR(255) NOT NULL,
    count_estimate BIGINT NOT NULL,
    PRIMARY KEY (region, window_start, rank)
);

4. Topic Trend History (aggregated)

CREATE TABLE topic_trend_history (
    topic          VARCHAR(255) NOT NULL,
    window_start   TIMESTAMP NOT NULL,
    window_end     TIMESTAMP NOT NULL,
    region         VARCHAR(50) NOT NULL,
    count_estimate BIGINT NOT NULL,
    PRIMARY KEY (topic, window_start, region)
);

📌 Notes

events → optional, only if you want replay/audit.
cms_buckets → stores sliding window CMS snapshots (for recovery & merging).
topk_topics → serves /trending API.
topic_trend_history → serves /history API.

📌 Cassandra CQL Schemas

1. Raw Events (optional)

CREATE TABLE events (
    topic_id    TEXT,
    event_time  TIMESTAMP,
    user_id     TEXT,
    region      TEXT,
    lang        TEXT,
    source      TEXT,
    payload     TEXT,             -- store JSON string if needed
    PRIMARY KEY ((topic_id), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

🔹 Partitioned by topic_id → efficient per-topic queries.
🔹 Ordered by event_time → efficient time-range scans.

2. Count-Min Sketch Buckets

CREATE TABLE cms_buckets (
    shard_id    INT,
    bucket_id   BIGINT,           -- epoch_minute
    width       INT,
    depth       INT,
    counters    BLOB,             -- serialized CMS 2D array
    updated_at  TIMESTAMP,
    PRIMARY KEY ((shard_id), bucket_id)
) WITH CLUSTERING ORDER BY (bucket_id DESC);

🔹 Partitioned by shard_id.
🔹 Buckets ordered by time for sliding window eviction.

3. Top-K Topics per Window

CREATE TABLE topk_topics (
    region        TEXT,
    window_start  TIMESTAMP,
    rank          INT,
    topic         TEXT,
    count_estimate BIGINT,
    PRIMARY KEY ((region, window_start), rank)
);

🔹 Partitioned by (region, window_start) → efficient query:

SELECT * FROM topk_topics WHERE region=’IN’ AND window_start=’2025-09-28T10:00:00Z’;

4. Topic Trend History (Aggregates)

CREATE TABLE topic_trend_history (
    topic         TEXT,
    region        TEXT,
    window_start  TIMESTAMP,
    count_estimate BIGINT,
    PRIMARY KEY ((topic, region), window_start)
) WITH CLUSTERING ORDER BY (window_start DESC);

🔹 Partitioned by (topic, region).
🔹 Sorted by window_start for fast history scans.

HLD

📌 Microservices Architecture

1. Ingestion Service

Role: Receives raw events (tweets, posts, searches, etc.).
Tasks:
- Parse event.
- Publish to Kafka (or other event bus).
- Optional write to events table (Cassandra) for replay/audit.
Interfaces:
- POST /event → push single/bulk events.
Scale: Horizontally scalable, stateless.

2. Stream Processor

Role: Real-time consumer of Kafka events.
Tasks:
- Update Count-Min Sketch (Redis / in-memory CMS).
- Maintain sliding window buckets.
- Periodically compute Top-K per window.
- Write results into:
  - cms_buckets (Cassandra) for durability.
  - topk_topics (Cassandra) for serving queries.
Scale: Partitioned by topic_id / region to parallelize.

3. Trending Service

Role: API layer for clients (dashboards, apps).
Tasks:
- Fetch Top-K trending topics from Cassandra (topk_topics) or cache.
- Fetch topic history (topic_trend_history).
- Query current CMS for approximate count of a topic.
Interfaces:
- GET /trending?region=IN&window=1h
- GET /topics/{id}/history?region=IN&period=24h
- GET /topics/{id}/count?window=1h

4. Scheduler / Window Manager

Role: Orchestrates window sliding.
Tasks:
- Every X minutes: expire old buckets, trigger top-K recomputation.
- Insert new row in topk_topics.
Interfaces: Internal (cron / job).

5. Storage Layer

Cassandra → cms_buckets, topk_topics, topic_trend_history, events.
Redis / Memcached → Hot CMS state + caching trending queries.

6. Analytics / Dashboard Service (optional)

Role: Used by ops/analysts.
Tasks:
- Show trending topics globally or per region.
- Provide insights (growth rate, anomalies).
Interfaces: Internal dashboards.

📌 Interaction Flow (Step-by-step)

Client → Ingestion Service
- Sends new event (topic=”Elections”, user_id=123).
Ingestion Service → Kafka
- Publishes normalized event.
Stream Processor → Kafka
- Consumes events, updates CMS in Redis (fast path).
- Periodically flushes CMS snapshot → Cassandra (cms_buckets).
- Maintains Top-K heap → writes to topk_topics.
Trending Service → Cassandra / Redis
- Client queries /trending.
- Service fetches precomputed top-K from Cassandra or cached results in Redis.
Scheduler → Stream Processor + Storage
- Triggers eviction of expired buckets.
- Maintains sliding window freshness.

👉 Think of it like this:

Fast Path (real-time): Redis CMS + Top-K heap.
Durable Path (historical): Cassandra for queries and recovery.

HLD

Shashank’s Substack

Discussion about this post