06 Get Hands Dirty SQL Query Tool How to Multitask With One Fish

06 Getting Hands Dirty: How Can SQL Query Tools Serve Multiple Purposes? #

Hello, I am Chen Tian.

Through examples like HTTPie and Thumbor, I believe you have a more intuitive understanding of Rust’s capabilities and coding style. We have mentioned before that Rust has a very wide range of applications, but these two examples have not yet made that clear.

Some students want to see examples of code with a lot of lifetime annotations in actual work; some are curious about Rust’s macros; some are interested in interoperability between Rust and other languages; and some want to know what it feels like to use Rust for client-side development. So, today we are going to use a hardcore example to cover all these topics.

Without further ado, let’s get started.

SQL #

We often deal with various data sources at work, including databases, Parquet, CSV, JSON, etc. The interactions with these consist of data fetching, filtering, projection, and sorting.

For big data work, tools like Spark SQL are used to query heterogeneous data, but our usual use of SQL is not as powerful. Though SQL querying of databases is supported by any DBMS, if you want to query CSVs or JSON with SQL, it usually requires a lot of additional processing.

So, wouldn’t it be meaningful if there was a simple tool that doesn’t need to introduce Spark, but can support SQL queries on any data source?

For instance, wouldn’t it be amazing if your shell could support this use case? Image Or consider our client fetching a data subset from a server API. If this subset could be directly queried with SQL on the front end, it would be very flexible and users could get immediate responses.

There is a well-known Greenspun’s tenth rule in the software field:

“Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.”

Let’s create a Programmer’s 42nd Law in the spirit of this:

“Any API interface that becomes sufficiently complex will contain an ad hoc, informally-specified, bug-ridden, slow implementation of half of an SQL capability.”

So today, how about we design a library that can query any data source using SQL and get results? Of course, as an MVP (Minimum Viable Product), let’s just support CSV SQL querying for now. Not only that, but we also hope this library can be used with Python3 and Node.js.

Guess how many lines of code this library will take? Today’s challenge is quite tough, surely at least 500 lines, right? Let’s set a benchmark of 500 lines of code for this challenge.

Design Analysis #

First, we need an SQL parser. Writing a parser in Rust is not difficult, and you can use serde, any parser combinator, or PEG parser like nom or pest. However, the Rust community has solutions for SQL parsing, such as sqlparser-rs.

Next is how to load CSV or other data sources into a DataFrame.

Those who have done data processing or used pandas should be familiar with DataFrame. It is a matrix data structure where each column may contain different types, and filtering, projection, sorting, and other operations can be performed on the DataFrame.

In Rust, we can use polars to load data from CSV to DataFrame and perform various subsequent operations.

Having identified these two libraries, the following work revolves around mapping the abstract syntax tree AST parsed by sqlparser to the DataFrame operations in polars.

An abstract syntax tree is a tool for describing complex grammar rules. From SQL or a certain DSL to a programming language, the structure of the language can be described with an AST, as shown in the figure below (source: wikipedia):

Image

How do we map SQL syntax to DataFrame operations? For example, if we want to select three columns from the data, the “select a, b, c” should be able to map to outputting columns a, b, and c in a DataFrame.

Polars internally has its own AST that can aggregate various operations, such as “where a > 10 and b < 5,” the Polars expression is: col("a").gt(lit(10)).and(col("b").lt(lit(5))). col represents a column, gt/lt means greater than/less than, and lit stands for literals.

With this understanding, the core problem of “querying CSV and other sources with SQL” becomes how to convert one AST (SQL AST) into another AST (DataFrame AST).

Wait a minute, isn’t this what macro programming does (for Rust, procedural macros)? Because after further analyzing the data structures of the two, we can see the following relationship:

Image

You see, our main task is to convert between two data structures. So after writing today’s code, you will surely have enough confidence in macros.

Macro programming is no big deal. The main job, aside from quote/unquote, is to convert one syntax tree into another. And if you delve into the process of conversion, it is nothing more than transferring data structure to data structure. So to summarize: the main flow of macro programming is to implement several From and TryFrom. Isn’t that simple?

Of course, the conversion process is very tedious. If the language itself does not have a good pattern matching capability, macro programming is definitely a non-human torture for oneself.

Fortunately, Rust has great pattern matching support. Although it is not as powerful as Erlang/Elixir’s, it is more than sufficient to outperform most programming languages. When writing macros, you’ll directly experience this.

Creating an SQL Dialect #

Okay, now that we have analyzed what we need to do, let’s proceed with writing code in an orderly manner.

Create a library with cargo new queryer --lib. Open the generated directory in VSCode, create a new folder named examples at the same level as src, and add the following code to Cargo.toml:

[[example]]
name = "dialect"

[dependencies]
anyhow = "1" # Error handling, we should use thiserror for libraries, but for simplicity, let's not complicate things
async-trait = "0.1" # Allows traits to have async fn
sqlparser = "0.10" # SQL parser
polars = { version = "0.15", features = ["json", "lazy"] } # DataFrame library
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] } # Our old friend, the HTTP client
tokio = { version = "1", features = ["fs"]} # Our old friend, the async library, we need asynchronous file handling here
tracing = "0.1" # Log processing

[dev-dependencies]
tracing-subscriber = "0.2" # Log processing
tokio = { version = "1", features = ["full"]} # We need more tokio features in the example

Dependencies are set. Let’s write an example using sqlparser, and it will look for a file named dialect.rs in the examples directory.

So, create a file named examples/dialect.rs and write some testing code for sqlparser:

use sqlparser::{dialect::GenericDialect, parser::Parser};

fn main() {
    tracing_subscriber::fmt::init();

    let sql = "SELECT a a1, b, 123, myfunc(b), * \
    FROM data_source \
    WHERE a > b AND b < 100 AND c BETWEEN 10 AND 20 \
    ORDER BY a DESC, b \
    LIMIT 50 OFFSET 10";

    let ast = Parser::parse_sql(&GenericDialect::default(), sql);
    println!("{:#?}", ast);
}

This code tests what Parser::parse_sql will output using an SQL statement. When writing library code, if you encounter third-party libraries that you do not understand, you can try them out by writing an example.

Run the command cargo run --example dialect to view the result:

Ok([Query(
    Query {
        with: None,
        body: Select(
            Select {
                distinct: false,
                top: None,
                projection: [ ... ],
								from: [ TableWithJoins { ... } ],
								selection: Some(BinaryOp { ... }),
                ...
            }
        ),
        order_by: [ OrderByExpr { ... } ],
        limit: Some(Value( ... )),
        offset: Some(Offset { ... })
    }
])

I simplified this structure a bit; what you see in the command line would be much more complex.

While writing to the ninth line here, did you suddenly wish you could add a URL or filename after the SQL FROM clause? That way, we could read data from that URL or file. Like the example of “select * from ps” at the beginning, where the ps command is used as a data source to easily obtain data from its output.

However, ordinary SQL statements don’t support this writing style, but sqlparser allows you to create your own SQL dialect, so let’s give it a try.

Create a file named src/dialect.rs and add the following code:

use sqlparser::dialect::Dialect;

#[derive(Debug, Default)]
pub struct TyrDialect;

// Create your custom SQL dialect. TyrDialect supports identifiers as simple URLs.
impl Dialect for TyrDialect {
    fn is_identifier_start(&self, ch: char) -> bool {
        ('a'..='z').contains(&ch) || ('A'..='Z').contains(&ch) || ch == '_'
    }

    // Identifier can include ':', '/', '?', '&', '='
    fn is_identifier_part(&self, ch: char) -> bool {
        ('a'..='z').contains(&ch)
            || ('A'..='Z').contains(&ch)
            || ('0'..='9').contains(&ch)
            || [':', '/', '?', '&', '=', '-', '_', '.'].contains(&ch)
    }
}

/// Helper function for testing.
pub fn example_sql() -> String {
    let url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv";

    let sql = format!(
        "SELECT location name, total_cases, new_cases, total_deaths, new_deaths \
        FROM {} where new_deaths >= 500 ORDER BY new_cases DESC LIMIT 6 OFFSET 5",
        url
    );

    sql
}

#[cfg(test)]
mod tests {
    use super::*;
    use sqlparser::parser::Parser;

    #[test]
    fn it_works() {
        assert!(Parser::parse_sql(&TyrDialect::default(), &example_sql()).is_ok());
    }
}

This code mainly implements the sqlparser::Dialect trait, allowing you to override methods that determine identifiers for the SQL parser. You will also need to add mod dialect; to src/lib.rs to import this file. Finally, a test was written which you can run with cargo test to try it out.

Tests passed! Now we can normally parse SQL like this:

SELECT * from https://abc.xyz/covid-cases.csv where new_deaths >= 500

Cool! As you can see, with about 10 lines of code (lines 7 to 19), by adding characters that allow URLs to be valid, we have implemented a custom SQL dialect parser that supports URLs.

Why is it so powerful? Because with traits, you can easily achieve Inversion of Control, which is a very common thing in Rust development.

Implementing AST Conversion #

Now that we have completed SQL parsing, next is to perform AST conversion using polars.

Since we aren’t too familiar with the polars library, let’s first test how to use it. Create examples/covid.rs (remember to add it to Cargo.toml), and manually implement loading and querying of a DataFrame:

use anyhow::Result;
use polars::prelude::*;
use std::io::Cursor;

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt::init();

    let url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv";
    let data = reqwest::get(url).await?.text().await?;

    // Direct request using polars
    let df = CsvReader::new(Cursor::new(data))
        .infer_schema(Some(16))
        .finish()?;

    let filtered = df.filter(&df["new_deaths"].gt(500))?;
    println!(
        "{:?}",
        filtered.select((
            "location",
            "total_cases",
            "new_cases",
            "total_deaths",
            "new_deaths"
        ))
    );

    Ok(())
}

If we run this example, we get a beautifully printed table, querying countries and regions with new_deaths greater than 500 from the file on GitHub at owid-covid-latest.csv.

We ultimately want to implement the same effect through parsing a similar SQL query. How do we do that?

As discussed at the beginning today, the main task is to convert the AST parsed by sqlparser into an AST defined by polars. Looking back at the output of the SQL AST:

Ok([Query(
    Query {
        with: None,
        body: Select(
            Select {
                distinct: false,
                top: None,
                projection: [ ... ],
								from: [ TableWithJoins { ... } ],
								selection: Some(BinaryOp { ... }),
                ...
            }
        ),
        order_by: [ OrderByExpr { ... } ],
        limit: Some(Value( ... )),
        offset: Some(Offset { ... })
    }
])

Here the Query is one of the structures of the Statement enum. Besides querying, an SQL statement also includes inserting data, deleting data, creating tables, etc. We are not concerned with these today, only the Query.

So, we could create a file named src/convert.rs, first define a data structure Sql to describe their correspondence, and then implement Sql’s TryFrom trait:

/// Parsed SQL
pub struct Sql<'a> {
    pub(crate) selection: Vec<Expr>,
    pub(crate) condition: Option<Expr>,
    pub(crate) source: &'a str,
    pub(crate) order_by: Vec<(String, bool)>,
    pub(crate) offset: Option<i64>,
    pub(crate) limit: Option<usize>,
}

impl<'a> TryFrom<&'a Statement> for Sql<'a> {
    type Error = anyhow::Error;
    fn try_from(sql: &'a Statement) -> Result<Self, Self::Error> {
        match sql {
            // Currently, we only care about query (select ... from ... where ...)
            Statement::Query(q) => {
                ...
            }
        }
    }
}

With the framework set, let’s continue writing the conversion. Let’s look at the structure of Query: it has a body that is of the Select type, which includes projection, from, and select. In Rust, we can use an assignment statement with pattern matching and data destructuring to extract them all at once:

let Select {
    from: table_with_joins,
    selection: where_clause,
    projection,

    group_by: _,
    ..
} = match &q.body {
    SetExpr::Select(statement) => statement.as_ref(),
    _ => return Err(anyhow!("We only support Select Query at the moment")),
};

In one statement, from matching to taking references, and then to assigning several variables by destructuring several fields inside the reference, are all done - it’s such a comfortable experience! With such productivity-enhancing language features, how can you not love it?

Let’s look at another example of handling Offset, converting sqlparser’s Offset into i64, similarly by implementing a TryFrom trait. Here again, we do the destructuring of the data structure in one branch of the match.

use sqlparser::ast::Offset as SqlOffset;

// Due to Rust's orphan rules, if we want to implement an existing trait for
// an existing type, we need to wrap it a bit

pub struct Offset<'a>(pub(crate) &'a SqlOffset);

/// Convert SqlParser's offset expr to i64
impl<'a> From<Offset<'a>> for i64 {
    fn from(offset: Offset) -> Self {
        match offset.0 {
            SqlOffset {
                value: SqlExpr::Value(SqlValue::Number(v, _b)),
                ..
            } => v.parse().unwrap_or(0),
            _ => 0,
        }
    }
}

Yes, data destructuring can also be performed within branches, just like the if let/while let mentioned in the third lecture. This all-around support for pattern matching, the more you use it, the more grateful you will be for the creators of Rust, especially when developing procedure macros.

From this piece of code, you can also see that the data structure Offset uses the lifetime annotation <‘a>, because it uses a reference of SqlOffset. We will discuss lifetimes soon; for now, you don’t need to understand why this is done.

The entire src/convert.rs is mainly about pattern matching and conversion between different subtypes. The code is comparatively dull and similar to the ones above, so I won’t post it here. You can get the complete code in the GitHub repo of this course at 06_queryer/queryer/src/convert.rs.

When you write procedure macros (procedure macro) in Rust in the future, you will basically do this kind of work, except that in the end, you need to output the converted AST using quote as code. In this example, we don’t need to do that; polars’ lazy interface can handle the AST directly.

A side note: