Introduction
Welcome to this comprehensive exploration of indexes in relational databases. In this article, we will take an in-depth look at the world of indexes, covering various types, creation and modification techniques, indexing strategies, the impact of cardinality, downtime considerations, covering indexes, and index usage in partitioned tables. By the end of this journey, you will have a solid understanding of how indexes work and how to leverage their power to optimize your database systems.
Part 1: Introduction to Indexes. Indexes play a crucial role in optimizing query performance and accelerating data retrieval in relational databases.
In this section, we will delve into the significance of indexes, their relationship with data organization, and the fundamental principles that govern their functioning. We will explore how indexes facilitate faster data access, reduce disk I/O, and improve the overall efficiency of database operations.
B-Tree Index Underlying Data Structure: The B-Tree data structure serves as the foundation for most indexing mechanisms in relational databases. B-Trees efficiently organize and store large amounts of data, allowing for efficient data retrieval operations. Basically B tree is a self balancing m-ary tree. So, searching in a B-Tree is relatively cheaper than scanning through all the records sequentially. The image below shows a pictoral representation of a B-Tree
B+Tree Index A Variation for Enhanced Performance: Building upon the concepts of B-Trees, the B+Tree index offers further performance enhancements for relational databases. In B+Tree all the data are stored in the leaf nodes, so the data retrieval is much easiser. Wonder why most databases use B+ trees to store the indexes? Checkout this link to the youtube video https://youtu.be/09E-tVAUqQw. Also keep in mind there are a lot of other strategies to maintain the indexes which are out of scope of the article. If you are interested to learn more plz follow the this link.
Why an Index required?
As we know in Data structures and Algorithm terms the big O’h notation roughly determines the time taken by an algorithm or a piece of code to execute. Let’s say I have an array of elements(repeated). If I want to know the total number of occurrences of a particular element in an array, it’ll take me O(n) time i.e., liner time, because we are sequentially scanning through the array for simplicity sake I am assuming the array is not sorted. Let’s say if we have a hash-map to store all the unique keys in the array and the key corresponding to that will contain the count of the repeated elements. This way we can decrease the time required to get our results. As hash-map reduces the time complexity, here the Index reduces the time complexity for obtaining the data.
Scans, Navigating through Indexes: Index scans play a crucial role in efficiently retrieving data from indexes. We will explore optimization techniques for index scans, such as prefetching and caching, to further enhance query performance.
Index Scans and Index-Only Scans(Eliminating Table Lookups)
Index scan is a technique wherein the index maintains a pointer to the location where the particular row is present on the Disk so as to get a instant access to the data. If a data is cached in memory or in buffer pool then that is a whole another discussion (link to the discussion). In index scan only the index will be cached Index-only scans provide a significant performance boost by eliminating the need for table lookups during query execution. Index only scan is performed when the data you are looking for is entirely present as a part of index. Some database designers go with this approach to stripoff of millisecond latency of fetching data in some crucial small tables. An index-only scan is faster, but it’s not always available as an alternative to a regular index scan. It has two restrictions: the index type must support Index-Only Scans and (somewhat obviously) the query must only project columns included in the index. By referencing data directly, we can reduce the number of disk seeks, but have to pay a cost of updating the pointers whenever the record is updated or relocated during a maintenance process. Index only scans can be achieved by using a technique called covering indexes. More things to keep in mind is that a index is not only cerated on a single column but also on multiple columns. Read through the first and second link in in the references page to know more about them.
Index Creation: Have you ever wondered how indexes are built/created on a table which frequently changes? Creating an index involves analyzing the data and constructing the index structure to facilitate faster data retrieval. Creating indexes on large tables can be a resource-intensive operation that may require database downtime. I myself have tried this and database management system does take a lock on the table you are indexing for. So, how can we get over it? How can me guarantee 99.9999% of SLA for our database? Modern databases uses techniques called online indexing, incremental indexing and parallel indexing to achieve this. More on this in the next articles (DBs in a high performing and distributed environments.)
Modifying Indexes: Adding, Updating, and Deleting Entries. As data in the database changes, indexes need to be updated to maintain their accuracy and relevance. Let’s break down all these scenarios one by one. At first when there is a new data addition to the table, a worker gets active in background which when a transaction get’s committed, updates the index entries by added a new key to the index table(let’s assume a B+ tree), and adds the pointer to the location of the data present. While updating a row in a database doesn’t affect the index much but the pointer does get’s changes because the updation of data is noting but deactivating the previous record and adding a new record and marking it active in most of the databases. While deleting a record from a table does have a same effect as addition because the index structure get’s changed if you are using a B tree kind of a structure. Because deleting a node from a self balancing tree or even a tree could lead to multiple reassigns of pointers in many a places.
Impact of Cardinality on Indexes: Cardinality, representing the uniqueness of values in a column, has a significant impact on index performance. Index works best when the data/column we are indexing is discrete in nature that is, it should be as unique as possible so that the hash value which derives from the same will be different and will lead to a single pointer location on the disk, otherwise if there are lot of records pointing to a single index then after reaching the disk, you may need to involve one more index or need to scan the whole block to get the desired result. SwiggyBytes have mentioned the same in one of their articles explaining how one bad indexing costed them 1 hour of downtime. Link to the article.
Conclusion:
In this comprehensive exploration of indexes in relational databases, we have covered various aspects of indexing, from underlying data structures like B-Trees and B+Trees to index scans, index creation and modification techniques, and considerations for downtime and cardinality.
We have discussed the pros and cons of indexes, covering indexes. By understanding these concepts, you are equipped to make informed decisions about index design and implementation, optimizing the performance and efficiency of your database systems.
My next articles will cover some of the major design principles of databases, how do they perform in a distributed environment, more different kinds of databases etc… Keep crunching…………. Bye.