Couple of questions related to Uber Design

Let's dissect them

Jul 01, 2025

Questions

Where to store the driver info ? Driver -> Vehicle . Where will you store ? Relational DB or NoSQL?

Driver location keeps changing every 5 seconds. Write-heavy system. Which is the best option to store this?

Customer is shown a price for a ride but he does not book the ride immediately. By the time he clicks 'Confirm ride', the price may change. The ride can no longer be given at the same old price. How to detect and handle this ?

Where do we store the booking information? How will you handle Uberpool? In some places and some timings, there may be surge in traffic. eg.. people leaving the office in the evenings. Can we cache any data ?

What data would we like to cache?

Answers

📦 1. Where to store the driver info? (Driver → Vehicle)

✅ Use: Relational Database

Why?

Driver and vehicle info are structured and relational (one-to-one or one-to-many).
Not updated frequently.
Strong consistency needed (e.g., license info, vehicle RC, insurance).

Table: Driver
- driver_id (PK)
- name
- license_number
- phone_number

Table: Vehicle
- vehicle_id (PK)
- driver_id (FK)
- registration_number
- model
- type (Sedan, SUV, etc.)

Alternative: If the business grows to multiple regions with flexible schemas, a Document Store (like MongoDB) can be considered for flexibility in schema.

🛰️ 2. Driver Location updates every 5 seconds — Where to store?

✅ Use: In-memory store or Fast NoSQL DB

Why?

High write throughput: drivers send GPS data every few seconds.
Doesn’t need long-term durability — short-lived data.
Low latency reads for nearby driver lookup.

Recommended Options:

Redis / Memcached: for fast read/writes (TTL based).
Apache Cassandra / DynamoDB: scalable NoSQL DB with TTL and wide-column support for querying latest locations.

Key: driver_id
Value: {lat, lng, last_updated_timestamp}
TTL: 30s

For geo queries:

Redis Geo API (GEOADD, GEORADIUS)
MongoDB geo-index
PostGIS for geo-aware RDBMS

💰 3. Customer sees a price, delays booking — what if price changes later?

✅ Problem: Stale fare quote

Solution Options:

Quote Expiry Mechanism
- When a quote is shown, it is stamped with a short expiry (e.g., 30–60s).
- On "Confirm Ride", backend checks whether the quote is still valid.
- If expired, backend recalculates the fare and prompts user to confirm again.
Fare Token
- When generating the fare, a signed token is returned with fare + expiry:

{
  "fare": 210.25,
  "expires_at": "2025-06-29T15:10:00Z",
  "token": "signed_fare_token"
}

On booking, backend validates the token's integrity and expiry.

Eventual Adjustment
- Some systems allow small deviations but may auto-adjust fare post-trip if price was misaligned. This is less ideal for user trust.

🧾 4. Where do we store the booking information?

✅ Use: Relational DB or Partitioned NoSQL

Why?

Bookings are transactional and need ACID properties.
Relationships to users, drivers, pricing, trips.
Frequently queried for analytics, history, receipts.

Relational DB (e.g., PostgreSQL/MySQL):

Table: Booking
- booking_id (PK)
- customer_id (FK)
- driver_id (FK)
- pickup_location
- dropoff_location
- fare
- status (CREATED, ASSIGNED, STARTED, COMPLETED)
- timestamp

Scale Tip: Shard bookings by region or customer_id. For high scale, hybrid model:

Metadata in RDBMS
Events/logs in Kafka
Full trip details archived to distributed stores like S3, HDFS

👥 5. How will you handle UberPool?

✅ Key Challenges:

Match riders with overlapping routes.
Optimize pickup/dropoff sequence.
Real-time routing and ETA recalculation.

Core Components:

Route Matching Engine:
- Calculates candidate pairs with similar pickup/dropoff zones and time windows.
- Uses geohashing + graph algorithms.
Trip Planner Service:
- Determines optimal pickup/dropoff order (like TSP with constraints).
- May use APIs like Google OR-Tools or custom Dijkstra/A*.
Pooling Scheduler:
- Runs matching in intervals (e.g., every 5s).
- Prioritizes matches based on max fill, shortest detour, SLA window.
Booking Schema Additions:

Table: PoolGroup
- group_id
- driver_id
- route
- capacity

Table: PoolBooking
- booking_id
- group_id
- pickup_slot
- dropoff_slot

📈 6. Traffic Surge (e.g., 6pm) — What can we cache?

✅ Yes, cache aggressively to reduce load and latency.

🔹 What to Cache?

Popular Routes' Fares:
- E.g., Office to Metro Station at peak time.
- Cache based on pickup-drop pair + time slot bucket.
Demand Heatmaps:
- Geospatial driver/passenger density
- Stored in Redis or similar (for surge decision logic)
Driver Availability Count per Zone:
- Useful to detect surge or load balance.
Geohash → Zone Info:
- Geohash mapping to regions or surge multipliers.
Surge Pricing Factors:
- Precomputed surge for each zone, TTL-based cache.
- Eg. {zone_id → surge_factor} updated every few minutes.

| Requirement                        | Tech / Storage Recommendation                 |
| ---------------------------------- | --------------------------------------------- |
| Driver Info (Profile, Vehicle)     | Relational DB (e.g., MySQL, PostgreSQL)       |
| Driver Location (frequent updates) | Redis / Cassandra / DynamoDB (with TTL)       |
| Fare Quote Validity                | Fare Token with Expiry or Timestamp check     |
| Booking Information                | Relational DB (sharded), Kafka for events     |
| UberPool                           | Route Matching + Trip Optimizer + Pool Engine |
| Traffic Surge Data Caching         | Redis (Popular Fares, Heatmaps, Surge Zones)  |

What is polyline and how does uber use it to store the driver location data points?

Yes, storing the delta difference between coordinates is a commonly used optimization technique, especially when dealing with polylines in Google Maps or GPS paths. This helps reduce the size of data for storage or transmission (e.g., over network or in URL parameters).

✅ Where and Why We Use Deltas

1. Google Encoded Polyline Algorithm:

Google Maps uses delta encoding combined with variable-length encoding (Base64-like) to compress a series of coordinates.
Instead of storing absolute latitude and longitude, it stores the difference (delta) from the previous coordinate.
This works well because GPS coordinates typically don’t change drastically from one point to the next, so deltas are small numbers — ideal for compression.

Coordinates:
(38.5, -120.2)
(40.7, -120.95)
(43.252, -126.453)

Converted to:
Delta-encoded:
(38.5, -120.2)
(+2.2, -0.75)
(+2.552, -5.503)

These deltas are then:

Scaled by 1e5
Converted to integers
Transformed using variable-length encoding
Output as a single compact string like: "_p~iF~ps|U_ulLnnqC_mqNvxq@"`

| Benefit                    | Description                                                          |
| -------------------------- | -------------------------------------------------------------------- |
| Compression                | Deltas tend to be smaller numbers → better compression               |
| Efficient for transmission | Especially useful in mobile or web apps where bandwidth is a concern |
| Used by Google APIs        | Required when using Google Maps Directions API encoded polylines     |

⚙️ Encoding/Decoding Tools

Google provides libraries or documentation for this:

Polyline Algorithm Description (Google Docs)
JavaScript/Java/Python libraries available to encode/decode polyline strings

🔁 Use-case Example

If you’re storing or transmitting lots of GPS paths (e.g., delivery routes, fitness tracking, fleet management), use:

Encoded polyline (compact string)
Store only the encoded string in DB
Decode it on client/UI when needed

for a vehicle tracking system, storing and handling polylines efficiently is critical, especially if you're tracking many vehicles in real time.

Let’s break down the best practices and design decisions around using deltas, polylines, and coordinate storage:

✅ 1. What You Need to Store

For each vehicle:

Vehicle ID
Timestamped location points: (lat, lng, timestamp)
Optionally: speed, direction, trip/session ID

✅ 2. Why Delta-Encoding / Encoded Polylines Help

If you're storing GPS trails every few seconds, you’ll generate a lot of data.
Instead of storing each absolute (lat, lng), store the delta from the previous point, which can then be:
- Compressed and encoded (e.g., Google Encoded Polyline)
- Stored as a single string per route/trip
This saves 50–80% storage space.

✅ 3. Storage Strategy

🚗 Real-time Location Table (latest point)

Store current location in a fast, read-optimized DB:

vehicle_id | lat     | lng     | last_updated
-----------|---------|---------|--------------
V123       | 12.9716 | 77.5946 | 2025-06-30T23:40:00Z

Use: Redis / Cassandra / DynamoDB / PostgreSQL (if low scale)

📍 Historical Trail Table (polyline / raw points)

Option A: Encoded Polyline

trip_id  | vehicle_id | start_time         | encoded_polyline              | path_size
-------- |------------|--------------------|-------------------------------|----------
T789     | V123       | 2025-06-30T10:00Z  | "_p~iF~ps|U_ulLnnqC_mqNvxq`@" | 300 points

Use: Compress entire trip path using Google’s encoded polyline format
Use a decode function in frontend (or backend) to convert it back into points

Option B: Raw Delta Points

[
  { "dlat": 250, "dlng": 100 },   // in 1e-5 degrees
  { "dlat": -30, "dlng": 20 },
  ...
]

Good if you want fine-grained control (like partial decoding)
But still better to encode and compress the full path for long trips

✅ 4. When to Use Encoded Polyline vs Raw Coordinates

| Use-case                    | Recommended Format        |
| --------------------------- | ------------------------- |
| Realtime location update    | Raw `(lat, lng)` in Redis |
| Short trip logs (\~1–5 min) | Raw coordinates / deltas  |
| Full trip replay or archive | Encoded polyline string   |
| Sending route to frontend   | Encoded polyline          |

✅ 5. Backend/Frontend Flow Example

Backend:

Receive streaming GPS from vehicle every 5s
Append to current trip (DB or in-memory buffer)
After every N points, compress and store as polyline

Frontend:

Fetch encoded polyline via API (GET /vehicle/{id}/trip)
Decode using google.maps.geometry.encoding.decodePath()
Draw polyline on map

Shashank’s Substack

Discussion about this post