File Systems

March 2021

The file system is an OS level software. It's goal is to store and manage data persistently, allowing for long term storage and retrieval.

A file system partitions disk into different types of blocks. The two main types are the data blocks which store the actual contents of the file, and the inode blocks which store file metadata. The file system also maintains some allocation structures for these two block types called the bitmaps. Furthermore, there is a superblock that contains file system metadata.

The most important structure in a file system is the inode. Each file in the system has an inode that tracks metadata about that file, including last accessed time, file size, file type, and most critically, the set of pointers to the actual contents of the file (some systems use multi-level indirect pointers to accommodate large files). Note that a directory is just a special type of file that has as its contents a list of (filename, inode number) pairs.

Basic File System

Consider the file system described above. Suppose that the client wants to open and read the file /foo/bar. The access path for this operation is as follows:

Get root inode number (well-known by the file system)
Locate inode block containing the root inode
Look inside the root inode to find the pointers to the contents of the root dir
Look for a foo entry inside the root dir
Get foo inode number
Locate inode block containing the foo inode
...

The ellipsis (...) means that the process repeats itself --- pathname traversal is recursive in nature. The process continues until the inode of the target file has been found, in which case the inode is moved into memory. This inode also gets updated (to reflect the new last accessed time).

Notice the potentially large amount of I/O operations that are involved in this sequence of activities. To combat this, file systems typically cache recently fetched blocks in memory.

This basic file system is simple, and it is operationally correct. However, it has very poor performance. It views disk as one giant array of blocks. Data blocks and inode blocks are scattered everywhere, with no regards to their locality. The next file system we introduce addresses this problem directly.

Fast File System

The Fast File System (FFS) exploits locality to improve performance by placing inodes and their data in close proximity. FFS adds one extra layer of indirection called block groups, which are just group of blocks. Each block group has its own data blocks, inode blocks, bitmaps, and superblock.

Blocks (within the same group) that are logically close to each other are also physically close to each other, this means that the amount of overheads incurred by the hardware, when accessing blocks in the same group, are reduced. In disk terms, this minimizes the seek time for multiple same-group requests.

FFS has its own policies for allocating files and directories. In allocating directories, FFS makes sure to spread them evenly among block groups, by finding the block groups with low number of directories and high amount of free inodes.

As for files, FFS always allocates them in the same block group as their parent directory. Furthermore, FFS ensures that both the inode blocks and the data blocks of each file are allocated to the same group. An exception is made for large files. The data blocks of a large file are spread evenly among block groups to prevent them from consuming all the data blocks within one block group.

Crash-Consistency Problem

Recall that the goal of the file system is to store data persistently. This issue becomes tricky when we take into account the possibility of the computer shutting down unexpectedly. This can lead to problems like file system inconsistencies and corrupted data.

The go-to method for addressing this issue is called journaling. The core idea behind journaling is to write down all pending updates in a separate log, before actually performing them. Here is a summary of the steps in journaling:

Data write: write data blocks into disk
Journal write: write transaction begin block and metadata updates in journal
Journal commit: write transaction commit block
Checkpoint: perform on-disk metadata updates
Free: mark transaction as free (release memory)

If a crash occurs during or before (3.), then that transaction and its data are lost. If a crash occurs after (3.), then the file system can initiate a recovery process and replay all committed transactions.

Notice that journaling does not solve the problem of lost data, but it does prevent any inconsistencies and garbage data from appearing in the file system. Also, to reduce the number of transactions in the journal, multiple transactions are usually batched and aggregated into a single global transaction.

Log-Structured File System

A log-structured file system (LFS) is a high performance file system that takes advantage of the sequential bandwidth of disks. LFS buffers all updates in an in-memory segment, and writes out this segment into disk only when the segment becomes full. Furthermore, segments are always written into an unused part of the disk. In other words, when a file is updated, LFS does not overwrite the file in disk. Instead, LFS writes out a new version of that file in some free location within the disk --- the old version becomes obsolete.

How do you find the correct inode of a file, given that there are now multiple versions of that inode, all scattered throughout disk? The solution is to use a data structure called an inode map (imap). The imap takes an inode number and returns the address of the most recent version of that inode.

The imap is updated in the same way as the inodes --- by writing the most recent version of the imap in the segment. Also, to improve performance, the imap is broken into chunks, and is updated on a per-chunk basis. The addresses of the most recent imap chunks are stored in a fixed place in disk called the checkpoint region. This checkpoint region is updated periodically.

One critical component of LFS is the garbage collector. The garbage collector finds all old versions of file structures (garbage) that are on disk, and cleans them up so that the blocks become free again for subsequent writes. Garbage collection also operates on a per-segment basis. When cleaning up a segment, the garbage collector determines the blocks in that segments that are live and those that are garbage. These live blocks are rewritten into disk (via a new segment), and the entire old segment becomes freed up.

All pending segment writes are logged in the checkpoint region (CR) of the file system. LFS maintains two versions of the CR, and updates them alternately. When a system crash occurs during a CR write, the file system can initiate the recovery process using the other CR. When a system crash occurs during a segment write, the LFS can easily recover by reading into the most recent CR. Note that both CRs are updated periodically, meaning that it is possible for the last few seconds of updates to be lost (those that are not logged into the CR).