On Disk Data Structures


Data is stored persistently on hard drives or other storage media using on-disk data structures, enabling access and modification even after a system restart or power loss. The retrieval, storage, and manipulation of data on the disc, which normally has longer access times and less bandwidth than memory, are all optimized by these data structures. The types of on-disk data structures, storage formats, data compression methods, indexing methods, sorting algorithms, performance concerns, and applications will all be covered in this article.

What are On Disk Data Structures?

On-disk data structures describe how data is kept on a tangible storage medium, like a solid-state drive or hard disc (SSD). The storage, retrieval, and administration of data are all optimized by these data structures.

Data is generally structured into several data structures when it is written to a disc to facilitate effective data storage and retrieval. For instance, whereas databases arrange data into tables and indexes, file systems organize data into files and directories.

Types of On Disk Data Structures

On-disk data structures come in many varieties, some of which are employed in computer science.

Arrays

Arrays are used to group together pieces of data of the same kind in memory. Arrays are easy to create and use, but because of their constant size, it is challenging to increase or decrease their size.

Linked lists

Linked lists are collections of nodes, each of which has information and a link to the node after it. Data may be readily added to or removed from linked lists and their sizes can be changed dynamically.

Trees

Trees are hierarchical data structures made up of nodes, each of which may also have children and a parent node reference. In file systems, organizational charts, and biological taxonomies, trees are frequently used to illustrate hierarchical connections.

Hash tables

Hash tables are data structures that link keys to indices in an array, where the matching values are kept, using a hash function. Fast data access is made possible by hash tables, but as the table fills up or collisions happen, performance may suffer.

Graphs

A graph is made up of nodes, which represent things, and edges, which indicate connections between nodes. Complex systems including social networks, transportation networks, and chemical processes are modeled using graphs.

Storage Formats

The effectiveness and performance of on-disk data structures are impacted by the storage format selection. Many well-liked storage types include −

CSV (comma-separated values)

CSV is a text format for storing data as a list of values. CSV is a format that many apps can read and write and is commonly used for data sharing.

JSON (JavaScript Object Notation)

JSON is a simple data transfer format that stores data as key-value pairs and employs a human-readable syntax. Several computer languages can readily interpret JSON, which is common in web development.

XML

XML is a markup language that allows for the storage of data as nested components and attributes. Although XML is often used for document processing and data sharing, its syntax is lengthy and intricate.

Data Compression Techniques

To conserve disc space and boost data transmission rates, data compression involves lowering the size of the data. Typical data compression methods include −

Run-length encoding (RLE)

RLE is a lossless compression method that substitutes a count and a single value for repeated data values. RLE works well for compressing data having a lot of repeated values, such as photos and audio.

Huffman coding

Huffman coding is a lossless compression method that represents data using variable-length codes. In Huffman coding, the average code length is decreased by using shorter codes for more common data values.

Lempel-Ziv-Welch (LZW) compression

A lossless compression method, LZW replaces repeated patterns with a reference to the dictionary. LZW is a powerful text and picture compression algorithm.

Indexing Techniques

The act of constructing data structures that provide quick access to particular data pieces using a search key is known as indexing. Many popular indexing methods include −

B-tree

A balanced tree data structure called a "B-tree" enables quick access to data using a variety of search keys. In databases and file systems, B-trees are often utilized.

Hash indexing

This method maps search keys to indices in an array or table by using a hash function. Fast data access is made possible via hash indexing, but as the table fills up or collisions happen, its performance may suffer.

Bitmap indexing

This method employs bit vectors to indicate whether or not data values for a certain attribute are present. For searches involving numerous characteristics and Boolean operators, bitmap indexing is effective.

Sorting Algorithms

Arranging data components in a certain order, such as ascending or descending order, is the process of sorting. Typical sorting formulas include−

Bubble sort

Bubble sort is a straightforward sorting algorithm that checks the positions of nearby components and swaps them if necessary. Bubble sort performs poorly and is hardly applied in real-world settings.

Quick sort

Quick sort divides an array into two sub-arrays, one with elements smaller than a pivot and one with elements greater than the pivot. It is a divide-and-conquer sorting algorithm. Fast sort is often used in practice and performs well in the average instance.

Merge sort

This method breaks an array into smaller sub-arrays, recursively sorts them, and then merges the sorted sub-arrays. It is a kind of divide-and-conquer sorting. Merge sort is often used in practice and offers good worst-case performance.

Performance Considerations

On-disk data structures' performance is influenced by a number of variables, such as CPU speed, cache size, disc access time, and disc bandwidth. Among the tactics for raising performance are −

Minimize disc access

By lowering the number of disc searches and reads, caching and prefetching can maximize performance.

Employ data compression

Data compression can reduce access times and enhance transfer speeds while also reducing disc space use.

Choose the right data structures

By lowering the frequency of disc visits and cache misses, performance may be increased by selecting data structures that are optimal for the application's access patterns and data types.

Applications of On-Disk Data Structures

Some programs, including the following, make use of on-disk data structures −

Databases

To effectively store and retrieve data, databases employ on-disk data structures. Databases frequently employ B-trees, hash tables, and bitmap indexes as on-disk data structures.

File systems

File systems effectively organize and access files and directories by utilizing on-disk data structures. B-trees and linked lists are two frequently used on-disk data structures in file systems.

Search engines

Search engines effectively index and search vast amounts of data by using on-disk data structures. Inverted indices and B-trees are frequent on-disk data structures used in search engines.

Conclusion

Effective data storage and access on disc require the use of on-disk data structures. Depending on the needs and limitations of the application, a variety of on-disk data structures, storage formats, compression techniques, indexing strategies, and sorting algorithms are available. For obtaining optimal performance, factors including CPU speed, cache size, disc access time, and disc bandwidth are crucial. Applications that employ on-disk data structures include file systems, databases, and search engines, to mention a few. Developers may create scalable and effective data storage solutions by knowing the fundamentals and best practices of on-disk data structures.

Updated on: 20-Jul-2023

398 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements