12 Object Oriented How to Implement a Search Engine

12 Object-Oriented - How to Implement a Search Engine #

Hello, I’m Jingxiao. In this lesson, we will implement a Python search engine.

Building on what we learned in the previous lesson, the main objective of today’s class is to take you through the iterative development process of agile development, and consolidate the concept of object-oriented programming.

We will start with the simplest and most direct search and gradually optimize it. Although I won’t delve into advanced algorithms, I will inevitably introduce some basic concepts used in modern search engines such as corpus and inverted index.

If you have some background knowledge in this area, you will naturally be able to understand it easily. Even if you have never been exposed to search engines before, there is no need to worry too much. I will strive to keep things concise and clear, making learning easier. At the same time, I hope you focus more on object-oriented modeling principles.

The “High-End” Search Engine #

The term “engine” is just as it sounds, very cool and impressive. A search engine is one of the most important gateways to the Internet in the early 21st century. With the help of search engines, China and the United States have given birth to giant companies such as Baidu and Google.

Search engines have greatly facilitated life on the Internet and have become an essential tool for surfing the web. Internet advertising, which has developed based on search engines, has become the core business model of Silicon Valley and Chinese giants. Search itself is also constantly improving, and Facebook and WeChat have always had the intention of setting up search platforms for their own social products.

I don’t need to say much about the value of search engines. Today we will mainly look at the core components of search engines.

According to what my friend at Google said, when they join the company, they have a training course called “The life of a query,” which explains what happens when a user types a string of text in a browser and presses enter. Today, I will follow this line of thinking to give a brief introduction.

We know that a search engine is composed of four parts: the crawler, indexer, retriever, and user interface.

The crawler, commonly known as a “scrawler”, is able to crawl various websites on the Internet and deliver the content to the indexer. After receiving the web page and content, the indexer processes the content to form an index, which is stored in an internal database and waits for retrieval.

The user interface is self-explanatory and refers to the web page and app front-end interface, such as the search pages of Baidu and Google. Users send queries to the search engine through the user interface. After the queries are parsed, they are delivered to the retriever. The retriever efficiently retrieves the results and returns them to the users.

Crawling knowledge is not the focus of our study today, so I won’t go into it in depth. Let’s assume that the sample documents exist on the local disk.

To make it easier, we will only provide five documents for retrieval. The content is included in the code below:

# 1.txt
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# 2.txt
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# 3.txt
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# 4.txt
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# 5.txt
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

Now let’s define the base class SearchEngineBase. Here is the specific code. You don’t have to rush to implement it. Just follow the rhythm and learn slowly. Even the most difficult things can be digested.

class SearchEngineBase(object):
    def __init__(self):
        pass

    def add_corpus(self, file_path):
        with open(file_path, 'r') as fin:
            text = fin.read()
        self.process_corpus(file_path, text)

    def process_corpus(self, id, text):
        raise Exception('process_corpus not implemented.')

    def search(self, query):
        raise Exception('search not implemented.')

def main(search_engine):
    for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input()
        results = search_engine.search(query)
        print('found {} result(s):'.format(len(results)))
        for result in results:
            print(result)

SearchEngineBase can be inherited, and the inherited classes represent different algorithm engines. Each engine should implement the process_corpus() and search() functions, which correspond to the indexer and retriever we just mentioned. The main() function provides the retriever and user interface, so a simple wrapped interface is created.

Here are the specific details of the code.

The add_corpus() function is responsible for reading the file content. The file path is used as an ID and is sent to process_corpus() along with the content.
process_corpus() needs to process the content and save it along with the file path as an index.
search() takes a query, processes it, searches through the index, and returns the results.

Now that you understand these concepts, let’s implement a basic working search engine. The code is as follows:

class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine, self).__init__()
        self.__id_to_texts = {}

    def process_corpus(self, id, text):
        self.__id_to_texts[id] = text

    def search(self, query):
        results = []
        for id, text in self.__id_to_texts.items():
            if query in text:
                results.append(id)
        return results

search_engine = SimpleEngine()
main(search_engine)

########## Output ##########

simple
found 0 result(s):
little
found 2 result(s):
1.txt
2.txt

You may be surprised that it only takes such a short code to implement a search engine.

Yes, that’s right. Let’s break down the code:

SimpleEngine is a subclass of SearchEngineBase, implementing and inheriting the process_corpus() and search() interfaces. It also inherits the add_corpus() function (you can override it if you want), so we can directly call it in the main() function.

In our new constructor, self.__id_to_texts = {} initializes our private variable, which is a dictionary to store file names as keys and file contents as values.

process_corpus() simply inserts the file content into the dictionary. Note that the ID needs to be unique, otherwise, the new content with the same ID will overwrite the old content.

search() iterates through the dictionary and finds the string to search for. If found, it adds the ID to the results list and returns it.

As you can see, it’s very simple, right? This process always follows the object-oriented programming principles. Here are some questions for you to think about, as a small review:

Now you should have a clearer understanding of the constructor calling order and methods in parent and subclass, right?
How are functions overridden when inheriting?
How does the base class serve as an interface (you can delete the overridden function in the subclass or modify its parameters to see what error it throws)?
How are methods and variables connected?

Okay, let’s get back to the topic of search engines.

You can probably see that this implementation is simple but obviously not efficient: it takes up a lot of space after indexing since the indexing function does nothing, and it takes a long time to search because all the indexed files need to be searched again. If we consider the information capacity of the corpus as n, then the time complexity and space complexity here should both be O(n).

And there’s another problem: the query can only be a single word or a few words connected together. If you want to search for multiple words that are scattered in different positions in the article, our simple engine will not be able to handle it.

How can we optimize it then?

One immediate idea is to tokenize the corpus, treating it as individual words. This way, we only need to store a set of all the words in each document. According to Zipf’s law (https://en.wikipedia.org/wiki/Zipf%27s_law), in a natural language corpus, the frequency of a word is inversely proportional to its rank in the frequency table, following a power-law distribution. Therefore, tokenizing the corpus can greatly improve our storage and search efficiency.

So how can we implement it specifically?

Bag of Words and Inverted Index #

First, let’s implement a search model called Bag of Words. Take a look at the following code:

import re

class BOWEngine(SearchEngineBase):
    def __init__(self):
        super(BOWEngine, self).__init__()
        self.__id_to_words = {}

    def process_corpus(self, id, text):
        self.__id_to_words[id] = self.parse_text_to_words(text)

    def search(self, query):
        query_words = self.parse_text_to_words(query)
        results = []
        for id, words in self.__id_to_words.items():
            if self.query_match(query_words, words):
                results.append(id)
        return results
    
    @staticmethod
    def query_match(query_words, words):
        for query_word in query_words:
            if query_word not in words:
                return False
        return True

    @staticmethod
    def parse_text_to_words(text):
        # Remove punctuation and line breaks using regular expressions
        text = re.sub(r'[^\w ]', ' ', text)
        # Convert to lowercase
        text = text.lower()
        # Generate a list of all words
        word_list = text.split(' ')
        # Remove empty words
        word_list = filter(None, word_list)
        # Return a set of words
        return set(word_list)

search_engine = BOWEngine()
main(search_engine)


########## Output ##########


i have a dream
found 3 result(s):
1.txt
2.txt
3.txt
freedom children
found 1 result(s):
5.txt

You might notice that the code has become slightly more complex.

Let’s first understand a concept called Bag of Words Model, also known as BOW Model. It is one of the most common and simplest models in the field of NLP.

In this model, we treat a text as a collection of words without considering grammar, syntax, paragraphs, or the order in which the words appear. As a result, we replace id_to_texts with id_to_words, where we only need to store these words instead of the entire article, without considering their order.

In the process_corpus() function, we call the static method parse_text_to_words to break the article into a bag of words, convert it into a set, and then store it in a dictionary.

The search() function is a bit more complex. Here, we assume that the desired result is to find an article where all the search keywords appear. So, we need to break down the query into a set of words and compare each word in the set to every article in our index. This process is handled by the static method query_match.

You can review the static methods we learned in the previous lesson. As you can see, these two methods don’t have any state and don’t involve the private variables of the object (no self as an argument). Given the same input, they produce exactly the same output. Therefore, we set them as static methods to facilitate the use by other classes.

However, even with these optimizations, we still need to iterate through all the IDs for each query. Although this saves a lot of time compared to the simple model, searching through billions of web pages is still too expensive. At this point, how can we further optimize?

You may think that the number of words in each query is not very large, usually just a few or at most a dozen. Can we start from here?

Also, the bag of words model doesn’t consider the order of words, but some people want the words to appear in a particular order or want the search words to be closer in the text. In this case, the bag of words model is no longer effective.

Can we do better for these two points? Obviously, we can. Take a look at the following code:

import re

class BOWInvertedIndexEngine(SearchEngineBase):
    def __init__(self):
        super(BOWInvertedIndexEngine, self).__init__()
        self.inverted_index = {}

    def process_corpus(self, id, text):
        words = self.parse_text_to_words(text)
        for word in words:
            if word not in self.inverted_index:

(To be continued…) self.inverted_index[word] = [] self.inverted_index[word].append(id)

def search(self, query): query_words = list(self.parse_text_to_words(query)) query_words_index = list() for query_word in query_words: query_words_index.append(0)

# If the inverted index of a query word is empty, we immediately return
for query_word in query_words:
    if query_word not in self.inverted_index:
        return []
    
result = []
while True:
    
    # First, get the indices of all inverted indexes in the current state
    current_ids = []
    
    for idx, query_word in enumerate(query_words):
        current_index = query_words_index[idx]
        current_inverted_list = self.inverted_index[query_word]
        
        # If we have traversed to the end of an inverted index, end the search
        if current_index >= len(current_inverted_list):
            return result

        current_ids.append(current_inverted_list[current_index])

    # Then, if all elements in current_ids are the same, it means that the word appears in the corresponding document for all these elements
    if all(x == current_ids[0] for x in current_ids):
        result.append(current_ids[0])
        query_words_index = [x + 1 for x in query_words_index]
        continue
    
    # If not, we increment the smallest element by one
    min_val = min(current_ids)
    min_val_pos = current_ids.index(min_val)
    query_words_index[min_val_pos] += 1

@staticmethod
def parse_text_to_words(text):
    # Use regular expressions to remove punctuation and newlines
    text = re.sub(r'[^\w ]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Generate a list of all words
    word_list = text.split(' ')
    # Remove empty words
    word_list = filter(None, word_list)
    # Return a set of words
    return set(word_list)

search_engine = BOWInvertedIndexEngine() main(search_engine)

########## Output ##########

little found 2 result(s): 1.txt 2.txt little vicious found 1 result(s): 2.txt

First of all, I want to emphasize that you don’t need to fully understand the algorithm this time. The implementation here goes beyond the scope of this chapter. But I hope you won’t give up because of that. This example will show you how object-oriented programming isolates algorithm complexity and keeps the interface and other code unchanged.

Let’s take a look at this code. You can see that the new model continues to use the previous interface and only modifies the __init__(), process_corpus(), and search() functions.

This is actually a way of collaboration in large companies, where each layer handles its own logic after reasonable layering design. When upgrading the core of our search engine iteratively, the main function and user interface remain unchanged. Of course, if the company hires a new front-end engineer and needs to modify the user interface part, the new person doesn’t need to worry too much about the back-end, they just need to focus on data interaction.

Continuing with the code, you may have noticed the beginning of the Inverted Index. The Inverted Index model is a well-known search engine method, and I will give a brief introduction.

The Inverted Index, as the name suggests, means we keep a dictionary of word -> id this time. Then everything becomes clear. In the search() function, we only need to extract the inverted indexes of the desired query_words separately, and then find the elements common to these lists. These common elements, i.e. IDs, are the desired query results. This way, we avoid the embarrassment of traversing all the indexes.

process_corpus() builds the inverted indexes. Please note that the code here is very concise. In industrial fields, a unique ID generator is needed to mark each document with a different ID, and the inverted indexes should be sorted according to this unique_id as well.

Regarding the search() function, it is sufficient to understand what it does. It gets all the inverted indexes based on query_words. If it can’t get them, it means that some query words do not exist in any document, so it returns an empty list. After getting them, it runs a “merge K sorted arrays” algorithm to get the desired IDs and returns them.

Note that the algorithm used here is not the most optimal. The optimal solution requires using a minimum heap to store indexes. It is a well-known LeetCode hard problem. If you are interested, please refer to: https://blog.csdn.net/qqxx6661/article/details/77814794)

The traversal problem is solved. Now, let’s address the second question: What if we want to implement searching for words that appear in order, or if we want the searched words to be closer together in the document?

We need to keep the position information of the words in each document on the Inverted Index. In this way, when performing the merging operation, we can handle it accordingly.

I will end the introduction to the inverted index here. If you are interested, you can search for more information. Remember, our focus is on the abstraction of object-oriented programming. Don’t forget to appreciate this idea.

LRU Cache and Multiple Inheritance #

At this point, your search engine is finally online and getting more and more traffic (QPS). While feeling happy and proud, you realize that your servers are starting to struggle with the workload. After some investigation, you discover that more than 90% of the traffic is caused by repetitive searches. So, you come up with a great solution - adding a cache to your search engine.

So, in this last section, let’s talk about caching and multiple inheritance.

import pylru

class LRUCache(object):
    def __init__(self, size=32):
        self.cache = pylru.lrucache(size)
    
    def has(self, key):
        return key in self.cache
    
    def get(self, key):
        return self.cache[key]
    
    def set(self, key, value):
        self.cache[key] = value

class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
    def __init__(self):
        super(BOWInvertedIndexEngineWithCache, self).__init__()
        LRUCache.__init__(self)
    
    def search(self, query):
        if self.has(query):
            print('cache hit!')
            return self.get(query)
        
        result = super(BOWInvertedIndexEngineWithCache, self).search(query)
        self.set(query, result)
        
        return result

search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)

The code is simple. LRUCache defines a caching class that you can inherit from to use its methods. LRU cache is a classic cache that adheres to the principle of locality of reference in nature. It keeps recently used objects and gradually eliminates objects that haven’t been used for a long time. In this example, we directly use the pylru library for the implementation of LRU cache to keep things simple.

Using this cache is straightforward. You can use the has() function to check if a key is in the cache, and if it is, use the get() function to retrieve the value directly. If the key is not in the cache, you can send the query to the background for computation, and then store the result in the cache.

As we can see, the class BOWInvertedIndexEngineWithCache employs multiple inheritance from two classes. First, you should pay attention to the constructor (did you think about the question in the last lesson?). There are two ways to initialize multiple inheritance, and let’s take a look at each one.

The first method initializes the first parent class directly with the following line of code:

super(BOWInvertedIndexEngineWithCache, self).__init__()

However, when using this method, the top-level parent class in the inheritance chain must inherit from object.

The second method is used for multiple inheritance when multiple constructors need to be called. In this case, we need to use the traditional method LRUCache.__init__(self).

Furthermore, you should note that the search() function is overridden by the subclass BOWInvertedIndexEngineWithCache, but I still need to call the search() function of BOWInvertedIndexEngine. How should we do this? Take a look at the following line of code:

super(BOWInvertedIndexEngineWithCache, self).search(query)

We can forcefully call the overridden function of the parent class.

With this implementation, we have achieved caching in a concise manner without modifying the existing code of BOWInvertedIndexEngine. I encourage you to read this section multiple times and understand it deeply. Through this example, try to grasp the advantages of inheritance.

Summary #

Today’s lesson is about the practical application of object-oriented programming. Compared to the previous theoretical knowledge, the content may not be that friendly. However, if you can calm down and study carefully, and understand the key points of the whole process, it will definitely benefit your understanding of object-oriented programming. For example, you can test your understanding of today’s lesson by answering the following two questions:

Can you extract all the attributes and functions of the classes in today’s lesson and draw the inheritance relationship on paper?
What is the iterative development process like?

In fact, for me, using a search engine as an example to explain object-oriented programming took quite some effort. Although it involves some specialized knowledge and algorithms of search engines, due to limited space, it can only be considered a starting point. If you have gained something from it, I will be satisfied.

Thought question #

Finally, I’ll leave you with a thought question. Can private variables be inherited? If not, how would you go about inheriting them? Feel free to leave a comment to share and discuss with me. You are also welcome to share this article with your colleagues and friends to exchange ideas and make progress together.