19 Through Trees and Graphs See How to Find Paths and Order in Disorder

19 Through Trees and Graphs See How to Find Paths and Order in Disorder #

Hello, I am Ishikawa.

In algorithms, the two most common operations are sorting and searching. Earlier, we learned about different sorting algorithms in linear data structures using arrays. Later, we learned about hash algorithms and how they combine with linked lists to support indexing and searching by trading space for time in a more special linear structure called a hash table.

In real life, we often encounter information that is not necessarily ordered. We discussed this when talking about functional programming, where one aspect of functional thinking is how to deal with the unknown and chaotic. Just like real-world mappings, the data structures and algorithms in functions also need to handle such complex situations. Today, we will look at two types of non-linear data structures that support us in handling this unknown and chaotic data - trees and graphs.

We know that some large consulting firms like McKinsey often use the “Pyramid Principle” to develop solutions. Its core principle explains that the information we receive in everyday life is a messy network structure, and our job is to organize this chaotic information into a “pyramid” shaped tree-like structure and then present it in a linear “storytelling” manner. This is very similar to the various data structures we use in algorithms.

Today, we will delve into these two data structures - trees and graphs. Let’s start with graphs, which are non-linear data structures. Many disorganized network organizations in our lives can be represented using graphs, such as social networks, internet communication, city roads, and website links. From a frontend development perspective, different dependent resources can also be considered disordered. For example, webpack, a commonly used module bundler in frontend development, helps us establish order in disorder by clarifying the dependencies among resources through its module loading feature. This functionality can be achieved through topological sorting. Let’s start by understanding the graph data structure itself and see how it accomplishes this step by step.

Establishing Resource Dependencies through Topological Sorting #

First, let’s take a look at the structure of the graph in the figure.

Understanding Graph Types and Structures #

As mentioned in the previous section, a graph consists of nodes connected by edges. If we extend this concept further, we can refer to two nodes connected by a line as adjacent vertices. Additionally, the number of nodes connected to a single node is called the degree. An edge can also be weighted. A path is a sequence of adjacent nodes.

A path that can return to the starting point is called a cyclic graph. A graph without paths that return to the starting point is called an acyclic graph. If the edges between the nodes in a graph point in a specific direction, it is called a directed graph. Conversely, if the edges between the nodes do not have a specific direction, it is an undirected graph. If a graph is both directed and acyclic, it is referred to as a directed acyclic graph (DAG), as we mentioned in the previous lecture.

One way to store a graph is through an adjacency matrix. If a graph has fewer adjacent vertices, it is called a sparse graph. Conversely, if a graph has more adjacent vertices, it is called a dense graph. For a sparse graph, using an adjacency matrix to store it may result in wasteful use of resources, as the matrix may maintain a high space complexity for only a few nodes. Similarly, adjacency matrices have disadvantages when it comes to supporting the insertion and deletion of nodes. Therefore, another way to store a graph is through an adjacency list. This data structure can be represented using an array, a linked list, a hash, or a dictionary, creating an effective complement to the adjacency matrix.

There are two classic ways to traverse a graph: breadth-first search (BFS) and depth-first search (DFS). Breadth-first search is commonly used to find the shortest path, while depth-first search is the traversal method that can be utilized for topological sorting, as we mentioned earlier. Let’s start by looking at breadth-first search.

Finding the Shortest Path in a Graph #

The concept of the shortest path is very common in everyday life, such as when we use maps. Common algorithms for finding the shortest path include Dijkstra’s algorithm and Floyd-Warshall algorithm. Now let’s take a look at Dijkstra’s algorithm using the example in the figure. Suppose we want to calculate the shortest distances from A to B, C, D, E, and Z.

Dijkstra’s algorithm initializes all distances as infinity: dist[i] = INF, and all visited arrays as false: visited[i] = false. Then, it sets the distance from the source node to itself as 0: dist[src] = 0. To find the shortest distance, it searches for the shortest unprocessed node from the source node: minDist(dist, visited), and marks it as visited: visited[u] = true. After finding and setting the shortest distance dist[u] + graph[u][v], the array of all paths is returned. Here is the simplified code:

const dijkstra = (graph, src) => {
  for (let i = 0; i < length; i++) {
    dist[i] = INF;
    visited[i] = false;
  }
  dist[src] = 0;
  for (let i = 0; i < length - 1; i++) {
    var u = minDist(dist, visited);
    visited[u] = true;
    for (let v = 0; v < length; v++) {
      if (...dist[u] + graph[u][v] < dist[v]) {
        dist[v] = dist[u] + graph[u][v];
      }
    }
  }
  return dist;
};

Finding Resource Dependencies in Disorder #

Returning to the initial question, when we use webpack or similar tools for resource packaging, it is important to understand the dependencies between modules. If we abstract this proposition, it is crucial to know how to establish order within chaos and analyze which node should be processed first in a directed graph. Now let’s take a look at how topological sorting manages resource dependencies. One of the methods that can be used here is a depth-first search (DFS) traversal.

In simple terms, topological sorting starts from a node and performs a DFS traversal until every vertex connected to it has been recursively exhausted. Each node visited during the recursion is added to a visited set, and finally, at the end of the recursion, the nodes are inserted in the opposite order at the beginning of an array using unshift.

dfs = function(v, visited, stack) {
  visited.add(v);
  for (var item in this.edges[v]) {
    if (visited.has(item) == false) {
      this.topologicalSortUtil(item, visited, stack)
    }
  }
  stack.unshift(v);
};

dfsTopoSort = function() {
  var visited = {},
  stack = [];
  for (var item in this.edges) {
    if (visited.has(item) == false) {
      this.dfs(item, visited, stack);
    }
  }
  return stack;
};

Building Web API Routing using Trie #

After talking about graphs, let’s now delve deeper into the structure of trees and the algorithms they support.

Understanding different types and structures of trees #

Tree is a very common data structure. It is similar to the trees we see in daily life, with branches and leaves on the tree trunk. These branches and leaves are called nodes in tree structure. However, when we visualize a tree in data structures, we usually reverse its head and tail. You may have seen organizational structures in companies, which can be understood as tree structures composed of nodes from top to bottom. At the top of the tree is the CEO of the company, and this node is called the root. HTML, which we use in frontend development, is also a typical tree structure, where each element has a parent node and child nodes. The Abstract Syntax Tree (AST) in compiler theory, which we mentioned earlier, is also a tree structure.

Within tree structures, there are binary trees and binary search trees (BST). Although their notations are similar, they represent different concepts. A binary tree is a tree with at most 2 child nodes, while a binary search tree refers to a tree where the value of a left node is smaller than that of a right node. Binary trees can be further categorized into full binary trees and complete binary trees. In binary tree traversal, there are three types of order: pre-order, in-order, and post-order. Let’s take a look at each of them.

There are three types of traversals in a binary search tree. The first type is in-order traversal, which means traversing the nodes in ascending order. The traversal order is left, root, right. The second type is pre-order traversal, which means visiting the root node first and then the child nodes. The traversal order is root, left, right. The third type is post-order traversal, which means visiting the parent node after visiting the child nodes. The traversal order is right, left, root. In addition to these three types, there is also level order traversal, which corresponds to the Depth First Search (DFS) principle we mentioned when talking about graphs.

Binary search trees have a problem. When a branch of the tree becomes too deep, there may be performance problems when adding, removing, or searching. Therefore, based on binary search trees, we have AVL trees and Red-Black trees, both of which are balanced binary trees. For AVL trees, the difference in depth between the left and right sides is at most one level.

To maintain balance after insertion, AVL trees adjust by performing left rotation or right rotation. In the above example, left rotation is performed. In the example below, we can see right rotation. Ideally, the complexity of an AVL tree is \(O(log2(n))\). However, when there are frequent insertions, deletions, and other operations, the efficiency decreases. In extreme cases, the complexity can degrade to \(O(n)\).

Although AVL trees have high query efficiency, the rotation operations performed to maintain balance after node insertion or deletion may increase complexity. This is where Red-Black trees come in to solve similar problems. How does a Red-Black tree achieve lower complexity? Let’s take a look. The key characteristic of a Red-Black tree is that each node is either red or black. The root node and all leaf nodes are black.

If a node is red, then its two child nodes are black. There cannot be two adjacent red nodes, which means a red node cannot have a red parent or child node. Every path from a given node to any of its descendant leaf nodes contains the same number of black nodes. As for insertion, it also follows two principles: the inserted node should be red, and the insertion position should be a leaf node.

Based on the above principles, in the above diagram, when we want to add node h, if it is the root node, we can simply make it black. But since it is a leaf node, it should be red. After insertion, if the parent node f is black, we don’t need to do anything. However, if f is red, we need to change f to black, and also adjust the grandparent nodes upwards. The nodes starting from c should contain the same number of black nodes, so the color of e is adjusted to black accordingly. In the example below, in addition to color-changing, the Red-Black tree, like the AVL tree, also needs to perform rotation in some cases.

The benefit of a Red-Black tree is that its complexity for operations such as querying, inserting, and deleting can be kept relatively stable, controlled at \(O(log2(n))\).

Within trees, there is a special type of structure called heap. Yes, when we talked about the concepts of stack and heap in the context of function call stacks before, we introduced the heap as a memory storage space. Today, from the perspective of a data structure, a heap is a complete binary tree. What’s the difference between a complete binary tree and a full binary tree? A full binary tree has every level completely filled with nodes except possibly the last level, where the nodes are filled from left to right. A complete binary tree, on the other hand, means that leaf nodes can only appear in the bottommost and second-to-bottommost levels, and the nodes in the bottommost level are concentrated at the leftmost positions of that level.

The characteristic of a heap is that the value of each of its nodes must be “greater than or equal to” or “less than or equal to” the values of its child nodes. The difference between a heap and other tree-like data structures, which store nodes and pointers, is that it can be stored in an array. It can be used for priority queues, and another common application is heap sorting. Similar to quicksort and merge sort, its time complexity is also (O(n\log_2(n))). However, it is worth noting that in the scenario of heap sorting, because it may require more compare and swap operations than quicksort, its performance may be relatively lower even though their time complexities are the same.

What are some string matching algorithms? #

After discussing trees, let’s take a look at strings. You might wonder what the relationship is between strings and trees. In fact, here, we are going to talk about another special tree-like data structure called a trie. But before talking about the trie, let’s first take a look at the string matching algorithms Brute Force (BF), Boyer-Moore (BM), Rabin-Karp (RK), and Knuth-Morris-Pratt (KMP). First, let’s compare the brute force and Boyer-Moore algorithms. In the following illustration, we can see the comparison between a set of strings and a pattern.

From the diagram above, we can see that if we use brute force, also known as the naive algorithm, in each iteration, the letters move one by one until a match is found. This process of sliding matching requires a total of 6 iterations. So, although the brute force algorithm is simple and straightforward, its efficiency is not high. On the other hand, the Boyer-Moore algorithm can utilize the bad character approach shown in the diagram to skip matching when a mismatch with the pattern is encountered. In addition to the bad character approach, there is also the good suffix approach, which achieves a similar effect by aligning the sliding position when a matching good suffix is found. Through these sliding matching techniques of skipping or aligning, the number of iterations can be reduced.

KMP is similar to BM, but the order of matching is reversed. In BM, we see that the bad character comes before the good suffix. In KMP, the good prefix comes before the bad character. In this way, once a mismatch is found, the next starting position to search can be determined, avoiding revalidating previously matched characters.

After discussing BM and KMP, let’s take a look at the RK algorithm. This matching method compares a substring of a string with the hash value of the pattern. If the hash values are equal, it indicates a match.

So, how do these string matching algorithms compare in terms of complexity? If we define the length of the string as (m) and the length of the pattern as (n), the following are their respective complexities. We can see that brute force has no preprocessing cost but its own processing complexity is high. BM is advantageous when dealing with large amounts of data, while KMP is more advantageous when dealing with smaller amounts of data. Compared to individual strings, RK is more advantageous when dealing with a large number of strings that need to be matched.

What are some string matching algorithms? #

Now, let’s discuss which algorithm is commonly used for routing. It uses a special type of trie called trie.

In this process, if the length of the search string is (w), the time complexity for insertion, search, and deletion is all (O(w)). If (m) is the number of input words and (n) is the length of the longest word, the space complexity is (O(m \times n)). Due to the high space complexity requirement, trie is most suitable for searching different strings that have the same prefixes. However, it is not suitable for searching a single pattern within a string. Therefore, it is particularly suitable for queries in routing.

So, how is trie implemented? It uses a nested object structure. Each layer has a child object as a child node. When performing operations such as search, insertion, and deletion, it is important to check whether the node already exists before the operation. If (n) is the length of the queried string, the time complexity for search, insertion, and deletion is all (O(n)) because each character needs to be checked. If (m) is the number of inserted strings and (n) is the length of the longest string, the space complexity is (O(m \times n)).

Therefore, from here, we can see that compared to other common string search methods, trie is more suitable for handling multiple strings with the same prefix. However, it is not suitable for searching a single pattern within a string because it would consume a large amount of memory space.

Conclusion #

That concludes this issue. Through the use of graphs and trees, we can see that their main characteristic is the ability to map out relatively unordered and unknown structures in the real world. From a concrete perspective, we can see that they can be used to solve problems in web development such as module bundling and routing. From an abstract standpoint, we can also see that algorithms often solve problems by establishing order in chaos and finding a way out of the unknown. This is also very helpful in organizing information in the face of disorder or making life choices. Additionally, when analyzing complexity, we must expand our thinking beyond just time and space complexity, and consider factors such as preprocessing and complexity in different application scenarios.

Thought question #

Today’s thought question is, when we talked about topology implementation, we used arrays and unshift. Do you think this approach is the most efficient? From the perspective of space and time complexity or other aspects, is there any room for optimization?

I look forward to seeing your thoughts and sharing in the comment section. Let’s discuss together. Also, feel free to share today’s content with more friends. See you next time!