Extra Meal Midterm Test Reference Implementation Explanation

Extra Lesson: Midterm Test - Reference Implementation Explained #

Hello, I’m Tyr Chen.

In the last lecture, I gave you a simple midterm exam exercise—I wonder how you did. Today, we’ll briefly discuss the implementation to give you a reference.

Supporting grep is not a complicated matter, and I believe that after using clap, glob, rayon, and regex, you could all write similar code (pseudocode):

/// Yet another simplified grep built with Rust.
#[derive(Clap, Debug)]
#[clap(version = "1.0", author = "Tyr Chen <[email protected]>")]
#[clap(setting = AppSettings::ColoredHelp)]
pub struct GrepConfig {
    /// regex pattern to match against file contents
    pattern: String,
    /// Glob of file pattern
    glob: String,
}

impl GrepConfig {
    pub fn matches(&self) -> Result<()> {
        let regex = Regex::new(&self.pattern)?;
        let files: Vec<_> = glob::glob(&self.glob)?.collect();
        files.into_par_iter().for_each(|v| {
            if let Ok(filename) = v {
                if let Ok(file) = File::open(&filename) {
                    let reader = BufReader::new(file);
                    for (lineno, line) in reader.lines().enumerate() {
                        if let Ok(line) = line {
                            if let Some(_) = pattern.find(&line) {
                                println!("{}: {}", lineno + 1, &line);
                            }
                        }
                    }
                }
            }
        });
        Ok(())
    }
}

This code feels very much like writing in Python, barring the time it takes to read a few dependencies, it’s almost trivial.

However, this code lacks testability, which will cause trouble for future maintenance and extension. Let’s look at how to optimize this snippet to be more testable.

How to Write a Good Implementation #

First, we need to strip out the main logic.

What is the main logic? Naturally, it is the grep for a single file, which is the part marked in the code. We can abstract it into a function:

fn process(reader: BufReader<File>)

Of course, from an interface perspective, this process function is too rigidly defined. What if the data is not from a File and one day the requirements change to also support data from stdio? Then the interface would need to change.

Therefore, we can use generics:

fn process<R: Read>(reader: BufReader<R>)

The generic parameter R only needs to satisfy the std::io::Read trait.

Although we have extracted this interface, it is still not testable because it directly uses println! to print out the found data. We can certainly return the lines to be printed in a Vec for testing.

However, this is testing for the sake of testing, a better way is to abstract the output from Stdout to any Write. Now the process interface would look like this:

fn process<R: Read, W: Write>(reader: BufReader<R>, writer: &mut Writer)

With this, we can use &[u8] implementing the Read trait as input, and a Vec that implements the Write trait as output, to perform tests. And in the rgrep implementation, we use File as input and Stdout as output. This satisfies the requirements, makes the core logic testable, and keeps the interface flexible enough to adapt to any input that implements Read and any output that implements Write.

Okay, with this in mind, let’s see how I wrote this rgrep, for your reference.

First cargo new rgrep to create a new project. Add the following dependencies to Cargo.toml:

[dependencies]
anyhow = "1"
clap = "3.0.0-beta.4" # We need to use version 3.0.0-beta.4 or later
colored = "2"
glob = "0.3"
itertools = "0.10"
rayon = "1"
regex = "1"
thiserror = "1"

For handling command lines with clap, we need version 3.0. Don’t mind the VS Code plugin telling you the latest version is 2.33; that’s because beta versions are not considered official.

Then create src/lib.rs and src/error.rs. In error.rs, add some error definitions:

use thiserror::Error;

#[derive(Error, Debug)]
pub enum GrepError {
    #[error("Glob pattern error")]
    GlobPatternError(#[from] glob::PatternError),
    #[error("Regex pattern error")]
    RegexPatternError(#[from] regex::Error),
    #[error("I/O error")]
    IoError(#[from] std::io::Error),
}

These are errors that need to be converted. thiserror can help us complete the conversion of error types with macros.

In src/lib.rs, add the following code:

use clap::{AppSettings, Clap};
use colored::*;
use itertools::Itertools;
use rayon::iter::{IntoParallelIterator, ParallelIterator};
use regex::Regex;
use std::{
    fs::File,
    io::{self, BufRead, BufReader, Read, Stdout, Write},
    ops::Range,
    path::Path,
};

mod error;
pub use error::GrepError;

/// Define types to simplify the writing of complex types when used
pub type StrategyFn<W, R> = fn(&Path, BufReader<R>, &Regex, &mut W) -> Result<(), GrepError>;

/// A simplified version of grep, supporting regular expressions and file wildcards
#[derive(Clap, Debug)]
#[clap(version = "1.0", author = "Tyr Chen <[email protected]>")]
#[clap(setting = AppSettings::ColoredHelp)]
pub struct GrepConfig {
    /// regex pattern for searching
    pattern: String,
    /// file wildcard
    glob: String,
}

impl GrepConfig {
    /// Search for matches using the default strategy
    pub fn match_with_default_strategy(&self) -> Result<(), GrepError> {
        self.match_with(default_strategy)
    }

    /// Search for matches using a specific strategy function
    pub fn match_with(&self, strategy: StrategyFn<Stdout, File>) -> Result<(), GrepError> {
        let regex = Regex::new(&self.pattern)?;
        // Generate a list of all files matching the wildcard
        let files: Vec<_> = glob::glob(&self.glob)?.collect();
        // Process all files in parallel
        files.into_par_iter().for_each(|v| {
            if let Ok(filename) = v {
                if let Ok(file) = File::open(&filename) {
                    let reader = BufReader::new(file);
                    let mut stdout = io::stdout();

                    if let Err(e) = strategy(filename.as_path(), reader, &regex, &mut stdout) {
                        println!("Internal error: {:?}", e);
                    }
                }
            }
        });
        Ok(())
    }
}

/// Default strategy, search sequentially from start to end and output to writer
pub fn default_strategy<W: Write, R: Read>(
    path: &Path,
    reader: BufReader<R>,
    pattern: &Regex,
    writer: &mut W,
) -> Result<(), GrepError> {
    let matches: String = reader
        .lines()
        .enumerate()
        .map(|(lineno, line)| {
            line.ok()
                .map(|line| {
                    pattern
                        .find(&line)
                        .map(|m| format_line(&line, lineno + 1, m.range()))
                })
                .flatten()
        })
        .filter_map(|v| v.ok_or(()).ok())
        .join("\n");

    if !matches.is_empty() {
        writer.write(path.display().to_string().green().as_bytes())?;
        writer.write(b"\n")?;
        writer.write(matches.as_bytes())?;
        writer.write(b"\n")?;
    }

    Ok(())
}

/// Format the output of matched lines, including line number, column number, and the first matched item with highlight
pub fn format_line(line: &str, lineno: usize, range: Range<usize>) -> String {
    let Range { start, end } = range;
    let prefix = &line[..start];
    format!(
        "{0: >6}:{1: <3} {2}{3}{4}",
        lineno.to_string().blue(),
        // Find the starting position of the match. Note that for non-ascii characters such as Chinese characters, we cannot use prefix.len()
        // This is an O(n) operation, which will slow down efficiency, here it is just to demonstrate the effect
        (prefix.chars().count() + 1).to_string().cyan(),
        prefix,
        &line[start..end].red(),
        &line[end..]
    )
}

Different from the initial thought, the process function is now called default_strategy(). Additionally, we provided two methods for GrepConfig, one is match_with_default_strategy(), and the other is match_with(), where the caller can pass in a function or closure to process the given BufReader. This is a commonly used method of decoupling.

In src/lib.rs, continue writing unit tests:

#[cfg(test)]
mod tests {

    use super::*;

    #[test]
    fn format_line_should_work() {
        let result = format_line("Hello, Tyr~", 1000, 7..10);
        let expected = format!(
            "{0: >6}:{1: <3} Hello, {2}~",
            "1000".blue(),
            "7".cyan(),
            "Tyr".red()
        );
        assert_eq!(result, expected);
    }

    #[test]
    fn default_strategy_should_work() {
        let path = Path::new("src/main.rs");
        let input = b"hello world!\nhey Tyr!";
        let reader = BufReader::new(&input[..]);
        let pattern = Regex::new(r"he\\w+").unwrap();
        let mut writer = Vec::new();
        default_strategy(path, reader, &pattern, &mut writer).unwrap();
        let result = String::from_utf8(writer).unwrap();
        let expected = [
            String::from("src/main.rs"),
            format_line("hello world!", 1, 0..5),
            format_line("hey Tyr!\n", 2, 0..3),
        ];

        assert_eq!(result, expected.join("\n"));
    }
}

Pay special attention to how the test uses the default_strategy() function, and how the match_with() method uses it. Running cargo test should pass both tests.

Finally, add command line processing logic in src/main.rs:

use anyhow::Result;
use clap::Clap;
use rgrep::*;

fn main() -> Result<()> {
    let config: GrepConfig = GrepConfig::parse();
    config.match_with_default_strategy()?;

    Ok(())
}

Run in the command line: cargo run --quiet -- "Re[^\\s]+" "src/*.rs" and you will get an output similar to the following. Note that the order of file output may not be exactly the same, since rayon operates on multiple threads in parallel.

Summary #

rgrep is a simple command-line tool. With just a few hundred lines of code, we’ve created a performance-pretty-good simplified version of grep. When not making complex interface designs, we can avoid lifetimes, generics, and even not being too concerned about ownership to write code very similar to scripting languages.

In this sense, Rust is also suitable for one-off, disposable code, or in other words, for writing a quick prototype. When we need better code quality, higher abstraction, and more flexible design, Rust provides enough tools to evolve the prototype into more mature code.

I believe you can feel the joy of developing software with Rust while doing rgrep.

We won’t assign thinking exercises today. You can reflect on the implementations of the KV server and rgrep tool. Congratulations on completing the foundational section of Rust learning; we’re more than halfway through, and I’ll see you in the advancement section in the next lesson.

Feel free to share this with friends and invite them to discuss together.

Further Reading #

There is a freshly baked video on YouTube: Visualizing memory layout of Rust’s data types, which summarizes the main data structures we mentioned in the first twenty lectures of the foundational section in 40 minutes. I personally really like this video, because it coincides with the mindset of “clarifying how data is stored on the heap and stack” that I have always advocated. I recommend it to you here as well. If you want a quick review, to check for any gaps in knowledge, then I highly suggest spending an hour to watch this video carefully.