Deep dive into Storage Engine of DBMS Episode I(Paging, Buffer Pool Manager, Write Ahead Log)
Introduction:
Have you ever wondered whenever you INSERT
or UPDATE
something into a database where does it get stored and how it is retrieved so fast and given back to the application/user? In the realm of database systems, there exists a crucial but often overlooked component that powers the efficiency and performance of data management: storage engines. These engines form the backbone of database systems, responsible for storing, retrieving, and manipulating vast amounts of information. Understanding the inner workings of storage engines is key to unlocking the full potential of databases. In this article, we will embark on a journey to demystify storage engines, exploring their architecture, functionalities, and their impact on database performance.
What is Storage Manager?
The storage manager is responsible for maintaining a database’s files. Some do their own scheduling for reads and writes to improve spatial and temporal locality of pages. It organizes the files as a collection of pages. Tracks data read/written to pages. Tracks the available space.
Part 1: Paging
A Page is a fixed-size block of data which contains anything from meta-data, log records to indexes as well. Each page has its own unique identifier which is used to identify a particular page in the cache.
In Database Analogy, cache may be of many a types. The underlying operating system has it’s own cache and the database also maintains it’s own cache. More about this in the Buffer Pool part of this article. There are basically 3 different notion of pages in Database they are,
- Hardware Page (usually 4KB)
- OS Page (usually 4KB)
- Database Page (upto 16KB)
Hardware Page is the smallest size of data or chunk of data that the storage device can guarantee it can write out atomically. You must be wondering what does atomically means? And why is that important? The answer to that is, In a database management system, we can define an atomic transaction as an indivisible as well as irreducible series of database actions in which either everything happens or nothing happens. Database pages serve as containers for storing data within the database. They are used to hold a specific amount of information, such as rows or records from database tables. The size of a page is determined based on factors like disk architecture and performance considerations. Pages play a crucial role in data retrieval and storage efficiency. When data is accessed from the database, the DBMS typically loads one or more pages from disk into memory. This process is known as disk I/O (input/output).
Part 2: Buffer Pool Manager
The buffer pool is an area in main memory(RAM) where Database caches table and index data as it is accessed. The buffer pool permits frequently used data to be accessed directly from memory, which speeds up processing of the same data eventually.
Basically the database system stores the data in persistent storage(disk). While repetitive access of data from the same is expensive, the frequently accessed pages are cached in in memory. When the same page containing the required data is requested by the storage manager, instead of going to disk and fetch the data from there, the page will be directly returned from the buffer pool itself, it means the cached version of the page is returned.
You might be wondering what happens when a page is cached in memory and there are two near concurrent transactions comes into database where in the first transaction wants to update one of the records present in the page and the second transaction wants to read the same record. What will happen at that time?
Answer to that depends on the implementation of the database strategies. Basically there are two types of cache based writes, they are:
- Write-Through cache : In this type of implementation, whenever an updation request comes, the update is first done to the page present in the cache and then synchronously blocks the client to make the same update in disk as well. The advantages of this implementation is that the data is durable and consistent across cache and disk. But one of the major disadvantages of this implementation is the increased number of I/O calls.
- Write-Back cache : In this type of implementation, whenever an updation request comes, the update is done only at the cache end and later some point in time the cache turns around and updates the same in the disk as well. Advantages of this approach are the decreased I/O calls which significantly improves the performance of the database but the major disadvantage is data is consistent across cache and disk. More on this in the Write Ahead Log section of the article.
A Page table keeps track of all the pages that are currently in memory. It also maintains dirty flags for cache updations and also is responsible for keeping the pages pinned in memory. Why pinning is required? In a distributed environment there’ll be lot of queries coming in to database which might access the same page or the different ones. If all the frames in the buffer pool are full and one more page needs to be cached, if none of the pages are pinned, the buffer pool manager just evicts any random page from the memory and puts the new page. If the page which got evicted was getting used by some other thread, this eviction will cause performance problem. In order to avoid that, whenever a thread reads a page, first it pins the page which will tell the manager that this page is currently being accessed by some other query. If there is an empty frame in the buffer, before writing a new page into the buffer, a ‘latch’ is taken so that no other thread tries to write to the same frame. In a real world database there may exists more than one buffer pool. In that case any request which comes in will first be hashed and will be routed to particular buffer pool just like nginx(reverse proxy). Buffer Pool Manager can also prefetch the pages from the disk if a continuous scan is happening on the neighbour page.
As we discussed earlier even Operating System also maintains it’s own cache. When you open a file/page OS tries to cache it in RAM or in some other cache. This may lead to duplicated cache as if the same page is being cached by both OS and the Database, it’s just a waste of memory. So many a databases advise the user to turnoff the system cache by using the O_DIRECT flag. (PostgreSQL on the other hand actually uses the OS cache)
Part 3: Write Ahead Log
Write-ahead logging (WAL) is a technique used in database management systems to ensure durability and recoverability of transactions. It involves recording changes to the database in a log file before they are applied to the actual data files.
Assume you are a small kid and your mom/dad have asked you to do some tasks for the day. You’ll first write down that task on a piece of paper and then eventually completes those tasks right? That exactly what WAL does. In this reference the piece of paper is the WAL.
Here’s how write-ahead logging works:
- Before Modifications: The key principle of write-ahead logging is that the log records must be written to disk before the corresponding modifications are applied to the actual data files. This ensures that the log reflects the changes made by transactions and serves as a reliable record of the database’s state.
- Sequential Write: The log records are written sequentially to a dedicated log file, often stored on disk. Sequential writes are typically more efficient than random writes because they minimize disk seek times.
- Durability and Recoverability: The write-ahead log provides durability by ensuring that committed changes are persistently stored on disk, even in the event of a system crash or failure. It enables database recovery by allowing the system to reconstruct the database state from the log records during the recovery process.
- Checkpoints: To manage the size of the log file, periodic checkpoints may be performed. A checkpoint is a synchronization point where all modified data pages in memory are written to disk, and the corresponding log records are flushed from memory to the log file. This process allows for the truncation or recycling of old log records.
You might ask what happens when a user COMMIT
their transaction. Will that be flushed to disk instantly?
The answer is No. Database does not guarantee the user when they COMMIT
their transaction that’ll be flushed to the disk instantly. Instead, the flushing of the WAL to disk can occur based on different strategies and configurations, which can vary across database systems. Some of the common approaches are as follows.
Synchronous Commit:
With synchronous commit, the database system ensures that the WAL records are flushed to disk before confirming the transaction’s commit.
This guarantees that the changes made by the transaction are durably stored on disk, providing a higher level of reliability.
However, synchronous commit can introduce additional latency as the transaction must wait for the disk I/O operation to complete before being confirmed.
Asynchronous Commit:
In asynchronous commit, the database system may delay or defer the flushing of the WAL to disk after the transaction is committed.
The advantage of asynchronous commit is improved performance since the transaction can be acknowledged as committed without waiting for disk I/O.
However, there is a small window of time during which the changes made by the transaction are stored only in the WAL and not yet persisted on disk.
Conclusion:
In this second part of our series on database internals, we explored the concept of storage manager, different type of pages, internal workings of buffer pool manager and write ahead logs. I have added all the reference materials in the references section below. If you want to explore more on the above topics, please go through the [3]reference book. In the subsequent parts of this series, we will delve further into the intricacies of database internals, uncovering more fascinating aspects of indexing, query processing, and optimization. And will surely include some cool commands/queries you can run in order to get to see these concepts being used in Database. Stay tuned for Episode2!
References: