Algorithm PUBLISHED UPDATED

Understanding Algorithms: Common Data Structures

In the article Understanding Algorithms: Getting Started, we mentioned that one strategy that can help design better algorithms is using data structures.

We also mentioned that a data structure is a way of storing and organizing data, with the purpose of making data access and modification more convenient. Furthermore, there are many types of data structures, and selecting the appropriate data structure for a specific problem is an important part of algorithm design.

In this article, we'll further introduce common data structures.

Below is the table of contents for this article:

  1. Array
  2. Matrix
  3. Linked List
  4. Stack
  5. Queue
  6. Rooted Tree
  7. Hash Table

Array

An array is a data structure that stores data elements in a contiguous, uninterrupted space in memory.

The core characteristics of arrays can be understood from how they are stored in memory, which directly determines their access efficiency and limitations.

The existence of arrays as a data structure is mainly to leverage the contiguous storage characteristics of computer memory to achieve the most efficient random access to data collections. Its core advantages are as follows:

However, arrays still have some disadvantages and limitations, including:

Matrix

A matrix is a two-dimensional array, typically implemented through one or more one-dimensional arrays for storage in memory.

The characteristics of matrices are mainly reflected in their diverse memory storage methods. We have several strategies for mapping a two-dimensional logical structure to one-dimensional linear memory. These strategies directly affect data access efficiency and usage flexibility.

Matrices can effectively organize and represent datasets with two-dimensional—or higher-dimensional—relationships. Whenever data has a logical concept of "rows" and "columns," such as tables, chessboards, image pixels, etc., using matrices for representation is very natural and intuitive.

Linked List

A linked list is a data structure in which objects determine their linear arrangement through internal pointers, rather than being determined by contiguous positions in memory like arrays.

A linked list is a highly flexible dynamic set representation. Its core characteristics revolve around the concepts of "nodes" and "pointers." Below is a detailed introduction to linked list characteristics:

  1. Node Structure

    • Each element in a linked list is an independent object, usually called a "node."

    • Each node contains at least two properties: one is a key for storing data, and one or more pointers pointing to other nodes. Nodes can also contain other auxiliary data.

  2. Pointers Determine Order

    • Unlike arrays, the linear order of a linked list is determined by the pointers within nodes. In a doubly linked list, each node x has:

      • x.next: Points to its successor node.

      • x.prev: Points to its predecessor node.

    • The first element of the list is called the "head," whose prev pointer is NIL (null). The last element is called the "tail," whose next pointer is NIL. The beginning of the entire list is identified by a property L.head pointing to the head.

  3. Various Forms

    • Linked lists can be divided into many types according to different structures and uses:

      • Singly Linked vs. Doubly Linked: Singly linked list nodes only have a next pointer, while doubly linked lists have both next and prev pointers.

      • Sorted vs. Unsorted: In a sorted list, elements are arranged according to their key values, with the smallest value at the head and the largest at the tail. Unsorted lists have no such restriction.

      • Circular vs. Non-circular: In a circular list, the tail's next pointer points to the head, and the head's prev pointer points to the tail, forming a circle.

  4. Sentinel Application

    • To simplify handling boundary conditions—such as insertion/deletion at the head or tail of the list—we can introduce a dummy object called a "sentinel" L.nil.

    • This sentinel is located between the head and tail, making the list a circular structure. For example: the sentinel's next points to the actual head, and prev points to the actual tail. This way, all insertion and deletion operations can be viewed as occurring between two nodes, without needing to write extra code for special cases at the head/tail, thus simplifying implementation.

The main purpose of linked lists is to provide a more flexible representation for "dynamic sets." In many application scenarios, we need a data collection that can efficiently add and delete elements, while traditional arrays have performance bottlenecks in these operations. The linear order of a linked list is determined by pointers rather than fixed memory indices, which is the source of its flexibility. Based on the above explanation, we can summarize two major advantages of linked lists:

However, linked lists also have their disadvantages:

Common applications of linked lists include:

Stack

A stack is a dynamic set where the element deleted is always the one most recently inserted, thus implementing the "Last-In, First-Out" (LIFO) principle. Below is a detailed introduction to stack characteristics:

The existence of the stack data structure is to solve a class of problems with "last-in, first-out" characteristics. Its design is simple and focused, bringing significant advantages but also some limitations. Below is a summary of the advantages and disadvantages of stacks:

Managing Subroutine Linkage is one of the most classic applications of stacks. When our code calls a function, the system PUSHes information like return address, parameters, and local variables onto a memory area called the "call stack." If this function calls another function, the new information is again PUSHed to the top of the stack. When a function finishes executing and returns, the corresponding information is POPped from the stack, and the program flow can return to the previous position. This last-in, first-out mechanism perfectly handles nested function calls.

Queue

A queue is a dynamic set that is a data structure following the "First-In, First-Out" (FIFO) principle. Like a waiting line in real life, the first element to enter the queue will be the first to be removed. Below is a detailed introduction to queue characteristics:

The existence of queues is to solve scenarios requiring sequential processing of tasks or data. Its core value lies in its built-in "fairness" and "ordering." When multiple items need to share limited resources (such as a printer or a CPU core), queues provide a simple and predictable management mechanism, ensuring that first-arriving requests are processed first. It abstracts the fair real-world behavior of "waiting in line" into a data structure in computer science. Below is a summary of queue advantages and disadvantages:

Advantages

Disadvantages

The FIFO characteristic of queues gives them wide applications in the computer science field. Below are some common examples:

Rooted Tree

A rooted tree is a linked data structure that represents non-linear hierarchical relationships through node objects and pointers. Its core characteristic lies in how nodes are organized and the connection relationships between nodes. Below, let's first introduce the basic characteristics of rooted trees:

Many relationships in the real world are essentially non-linear. Traditional linked lists are very suitable for representing linear relationship data like sequences and queues. However, when we need to simulate or process data with hierarchical, branching, or subordinate relationships, such as company organization charts, folder structures, family trees, etc., linear structures fall short. Rooted trees provide an intuitive and efficient way to model these complex non-linear hierarchical relationships.

Below, let's further introduce common types of rooted trees:

Binary Tree

A binary tree is a tree data structure where each node has at most two children (a left child and a right child). This is one of the most basic tree representations. Below are the characteristics of binary trees:

Heap

A heap is an array object that we can view as a nearly complete binary tree.

A heap is a very efficient data structure. Its core characteristics revolve around its special structure and element ordering rules. Below is a detailed introduction to heap characteristics:

The emergence of the heap data structure is mainly to solve efficiency problems of specific algorithms and provide an efficient information management design technique. Common applications include heapsort and priority queues.

Binary Search Trees

A binary search tree is a data structure rooted in binary trees. Its organization satisfies a key property: for any node x in the tree, all nodes in its left subtree have values less than or equal to x's value, while all nodes in its right subtree have values greater than or equal to x's value.

The core characteristics of binary search trees all revolve around its unique "binary-search-tree property," which determines its structure and behavior. Below is a detailed introduction to binary search tree characteristics:

Red-Black Trees

A red-black tree is a self-balancing binary search tree. By assigning each node a "red" or "black" color attribute and following a specific set of rules, it ensures that the longest path from the root to a leaf is never more than twice the length of the shortest path, thereby maintaining the tree's height at logarithmic level O(lgn)O(\lg n), ensuring that operations like search, insertion, and deletion maintain high performance even in the worst case.

The core of red-black trees is that they're not just binary search trees—they must always satisfy the following five "red-black properties." It's the combined effect of these properties that enables red-black trees to self-maintain balance.

First, we need to understand the structure of red-black trees. In addition to having key (key value), left (left child), right (right child), and p (parent) like ordinary binary search trees, each node additionally stores a color property whose value is either "red" or "black."

Below are the five properties that red-black trees must observe:

  1. Every node is either red or black.

    This is the most basic color rule, ensuring each node has a definite color state.

  2. The root node must be black.

    This rule ensures that paths starting from the root always begin with a black node.

  3. Every leaf node (NIL) is black.

    The "leaf node" here is a very critical concept. In the theoretical model of red-black trees, it doesn't refer to key-value nodes without children, but to the empty (NIL) positions being pointed to. For convenience in code implementation, usually a single black node T.nil called a "sentinel" is used to represent all these NIL leaves, as well as the root node's parent.

  4. If a node is red, then both its children must be black.

    This property is one of the keys to maintaining balance. It directly prohibits two consecutive red nodes from appearing on any path. This also means that on any path downward from any node, the number of red nodes won't exceed the number of black nodes.

  5. For each node, every simple path from that node to all its descendant leaf nodes contains the same number of black nodes.

    This property is the most ingenious and powerful rule of red-black trees. It defines a concept called "black-height," which is the number of black nodes on any path from a node x downward (not including x itself) to its descendant leaf nodes, denoted as bh(x)\operatorname { b h } ( x ). This rule ensures that for any node, all downward paths have equal "black length." Combined with the fourth property, this ensures no path is much longer than others, keeping the tree "approximately balanced."

Standard binary search trees perform well in ideal situations, but they have a fatal flaw: their performance entirely depends on the tree's height h, and all basic operations have a time complexity of O(h)O(h). If inserted data is random, the expected tree height is O(lgn)O(\lg n), with excellent performance. But if the inserted data is already sorted (for example, inserting 1, 2, 3, 4, 5 in order), the binary search tree degenerates into a linked list, with tree height h becoming nn. At this point, operation performance drops sharply to O(n)O(n), completely losing the advantage of the tree structure.

The birth of red-black trees is to solve this "worst-case" performance collapse problem. Through the five properties mentioned above, it forcibly constrains the tree's structure, ensuring the tree's height always remains at the O(lgn)O(\lg n) level. This way, even in the worst case, various dynamic set operations on red-black trees (such as search, insertion, deletion, finding maximum/minimum values, etc.) can guarantee a time complexity of O(lgn)O(\lg n).

After introducing the above four common rooted tree types, we can use the following simple analogy to more clearly understand their relationships:

First, Binary Tree is the ancestor of this family. Heap and Binary Search Tree are two children of the ancestor. They inherited the ancestor's characteristics but developed completely different personalities. Finally, Red-Black Tree is the "specialist" form that Binary Search Tree evolved into to make itself more powerful and reliable.

Hash Tables

A hash table is an efficient data structure that, through a special function called a hash function, directly computes and maps a key to a specific position in an array—called a slot—to achieve fast data access.

To deeply understand hash tables, we can break down its characteristics into the following core concepts:

The emergence of hash tables is mainly to solve the fundamental limitations of Direct-address Tables.

The advantages of hash tables include:

However, hash tables still have their disadvantages:

References

Introduction to Algorithms, fourth edition

Appendix

This article is part of a series. Currently, five articles have been written:

  1. Understanding Algorithms: Getting Started
  2. Understanding Algorithms: Common Data Structures
  3. Understanding Algorithms: Comparison-Based Sorting
  4. Understanding Algorithms: Linear-Time Sorting
  5. Understanding Algorithms: The Selection Problem