How does global index work in databses
A global index is an index that spans all the partitions or shards of a database table. It allows efficient querying across the entire dataset, regardless of how the data is partitioned.
What is a Global Index?
A global index is a database index that exists independently of the partitioning or sharding of the table it indexes.
It contains entries for data across all partitions or shards, making it possible to perform queries that are not constrained to a single partition or shard.
How Global Index Works
Centralized or Distributed Index Structure:
The global index may reside in a centralized node or be distributed across multiple nodes, depending on the database design.
Key Mapping:
The index maps the indexed column(s) to the primary key or row identifier of the data, allowing the database to quickly locate the desired records.
For example, if you have a table partitioned by
regionbut frequently query bycustomer_id, a global index oncustomer_idallows you to find the record without scanning all partitions.
Maintaining the Index:
The database must keep the global index updated for every
INSERT,UPDATE, orDELETEoperation. This often involves additional overhead as changes to data in one partition may require changes to the global index.
Query Execution:
When a query references the indexed column, the global index is consulted first to locate the matching records. The database then retrieves the data from the appropriate partitions or shards.
Benefits of Global Index
Cross-Partition Query Optimization:
Allows efficient queries across all partitions, avoiding the need for full table or partition scans.
Improved Performance:
Speeds up queries on columns that are not part of the partition key.
Simplifies Application Logic:
Applications do not need to be aware of the underlying data partitioning when querying.
Use in Secondary Indexing:
Ideal for secondary indexes on columns frequently queried but not used as the primary partition key.
Challenges of Global Index
Write Overhead:
Updates to the global index require coordination across all partitions or shards, leading to increased write latency.
For example, adding or deleting a record in one partition also requires updating the global index.
Consistency Issues:
Ensuring the consistency of the global index with the underlying data can be challenging, especially in distributed databases.
If a partition fails or becomes unavailable, the global index may become inconsistent.
Single Point of Failure:
If the global index is centralized, it may become a bottleneck or single point of failure.
Distributed global indexes can mitigate this but add complexity to the system.
Storage Overhead:
Maintaining a global index requires additional storage space to store the index entries.
Complexity in Sharded Environments:
In sharded databases, global indexes may require cross-shard communication, increasing latency for writes and some queries.
Global Index in Distributed Databases
Global indexes are particularly important in distributed or sharded databases, but they come with unique challenges. Here's how they are implemented in some popular distributed databases:
1. Apache Cassandra
Partitioned Indexing:
Cassandra does not natively support global indexes due to its partitioned architecture. Instead, it supports local secondary indexes, which work within a single partition.
However, some external solutions (e.g., ElasticSearch) can provide global indexing functionality.
2. Google Spanner
Spanner supports global secondary indexes, allowing cross-partition queries efficiently.
The global index is managed as a separate table that is automatically updated with changes to the underlying table.
3. Amazon DynamoDB
Global Secondary Index (GSI):
DynamoDB offers global secondary indexes, which allow queries on attributes other than the primary partition key.
GSIs are distributed across all partitions but can introduce consistency issues due to eventual consistency in some cases.
4. CockroachDB
CockroachDB supports global indexes, but they come with a performance cost in terms of write operations, as they require updates to the global index across multiple nodes.
5. MongoDB (with Sharding)
MongoDB does not support true global indexes across shards. Instead, it requires queries to specify the shard key or a scatter-gather approach to query all shards.
Use Cases for Global Indexes
Non-Partition Key Queries:
When queries involve columns that are not part of the partition key.
Example: Querying a
user_emailcolumn in a table partitioned byregion.
Multi-Tenant Applications:
When data is partitioned by tenant, but global queries are needed across tenants.
Analytics Queries:
For ad-hoc queries across large datasets in data warehouses.
Secondary Indexes in Distributed Databases:
When secondary attributes (other than the primary key) need to be indexed and queried efficiently.
Alternatives to Global Indexes
Local Indexes:
Use local secondary indexes for queries within a single partition.
Denormalization:
Store duplicate data in multiple partitions to avoid cross-partition queries.
Materialized Views:
Precompute query results and store them in a separate table or view for faster access.
Search Systems:
Use search engines like ElasticSearch or Solr for global search capabilities instead of relying on the database.
Conclusion
Global indexes provide powerful query capabilities across all partitions or shards in a database, making them essential for certain use cases. However, they introduce trade-offs in terms of performance, consistency, and complexity, especially in distributed systems. The choice to use a global index depends on the specific requirements and constraints of the application.


