32 Practical Projects Using Py O3 to Develop Python 3 Modules

Practical Project: Developing a Python 3 Module with PyO3 #

Hello, I’m Chen Tian.

In the last lecture, we introduced the basic usage of FFI. Today, let’s strike while the iron is hot and carry out a practical project to experience how to introduce excellent libraries from the Rust ecosystem to the Python/Node.js community.

Since there are already tools like PyO3 and Neon in the community, we don’t need to handle the details of Rust code compatibility with the C ABI. These tools can handle it directly. So, today we will mainly write the FFI shim layer code: - Shim Layer

Additionally, the basic operations of PyO3 and Neon are the same. If you know how to use one, it will be easy to understand the other. This lecture will take PyO3 as an example.

So, what kind of library should we provide to Python?

After some thought, I feel that the search engine that can be embedded in programs in the Python community is still a shortcoming. The whoosh I know has not been updated for many years, and pylucene requires running a JVM in Python, which always makes people feel uncomfortable. Although the flexsearch in Node.js looks good (I haven’t used it), overall, both communities need a more powerful search engine.

In Rust, there is the embedded search engine tantivy. Let’s use it to provide search engine functionality.

However, tantivy’s interface is relatively complex, and today’s theme is not about how to use a search engine’s interface. So, I created a crate based on tantivy, xunmi, which provides a very simple set of interfaces. Today, our goal is to provide the corresponding Python interfaces for these interfaces and make them feel consistent with Python usage.

Below is an example of calling xunmi with Rust:

use std::{str::FromStr, thread, time::Duration};
use xunmi::*;

fn main() {
    // Load a predefined schema from a YAML format configuration file
    let config = IndexConfig::from_str(include_str!("../fixtures/config.yml")).unwrap();

    // Open or create an index
    let indexer = Indexer::open_or_create(config).unwrap();

    // Data to index, which can be in xml/yaml/json format
    let content = include_str!("../fixtures/wiki_00.xml");

    // The wikipedia dump we use is in XML format, so InputType::Xml
    // Here, the wikipedia data structure id is a string, but the index schema is u64
    // Wikipedia does not have a content field; the node content ($value) is equivalent to content
    // So we need to define some data format conversions
    let config = InputConfig::new(
        InputType::Xml,
        vec![("$value".into(), "content".into())],
        vec![("id".into(), (ValueType::String, ValueType::Number))],
    );

    // Get the index updater to update the index
    let mut updater = indexer.get_updater();
    // You can use multiple updaters to update the same index in different contexts
    let mut updater1 = indexer.get_updater();

    // You can use add/update to refresh the index; add directly adds, update will delete the existing doc and then add a new one
    updater.update(content, &config).unwrap();
    // You can add multiple sets of data and commit them together
    updater.commit().unwrap();

    // Update index in another context
    thread::spawn(move || {
        let config = InputConfig::new(InputType::Yaml, vec![], vec![]);
        let text = include_str!("../fixtures/test.yml");

        updater1.update(text, &config).unwrap();
        updater1.commit().unwrap();
    });

    // By default, the indexer will automatically reload after each commit, but this will take hundreds of milliseconds of delay
    // In this example, we wait for a while before querying
    while indexer.num_docs() == 0 {
        thread::sleep(Duration::from_millis(100));
    }

    println!("total: {}", indexer.num_docs());

    // You can provide a query to get search results
    let result = indexer.search("历史", &["title", "content"], 5, 0).unwrap();
    for (score, doc) in result.iter() {
        // Because content in the schema is only indexed not stored, so the output does not have content
        println!("score: {}, doc: {:?}", score, doc);
    }
}

Here is what the index configuration file looks like:

---
path: /tmp/searcher_index # Index path
schema: # Schema for indexing, for texts, use CANG_JIE for Chinese word segmentation
  - name: id
    type: u64
    options:
      indexed: true
      fast: single
      stored: true
  - name: url
    type: text
    options:
      indexing: ~
      stored: true
  - name: title
    type: text
    options:
      indexing:
        record: position
        tokenizer: CANG_JIE
      stored: true
  - name: content
    type: text
    options:
      indexing:
        record: position
        tokenizer: CANG_JIE
      stored: false # We only index but do not store content
text_lang:
  chinese: true # If it is true, automatically convert traditional to simplified Chinese
writer_memory: 100000000

The goal is to use PyO3 to make Rust code usable this way in Python: - Python Usage

Alright, let’s not waste words; let’s start today’s project challenge.

First, create a new project with cargo new xunmi-py --lib and add in Cargo.toml:

[package]
name = "xunmi-py"
version = "0.1.0"
edition = "2021"

[lib]
name = "xunmi"
crate-type = ["cdylib"]

[dependencies]
pyo3 = {version = "0.14", features = ["extension-module"]}
serde_json = "1"
xunmi = "0.2"

[build-dependencies]
pyo3-build-config = "0.14"

Define the name and type of the lib. The lib name, we define as xunmi, so when importing it in Python, we use this name; crate-type is cdylib, we need the pyo3-build-config crate to handle some simple processing at build time (needed on macOS).

Preliminaries #

Next, before writing code, there is some preparatory work, mainly the build script and Makefile, to allow us to easily generate the Python library.

Create build.rs and add:

fn main() {
    println!("cargo:rerun-if-changed=build.rs");
    pyo3_build_config::add_extension_module_link_args();
}

It will add some build options during compilation. If you don’t want to handle it with build.rs, you can also create .cargo/config and then add:

[target.x86_64-apple-darwin]
rustflags = [
  "-C", "link-arg=-undefined",
  "-C", "link-arg=dynamic_lookup",
]

The effects of the two are equivalent.

Then we create a directory xunmi, then create xunmi/init.py and add:

from .xunmi import *

Lastly, create a Makefile and add:

# If your BUILD_DIR is different, you can make BUILD_DIR=<your-dir>
BUILD_DIR := target/release

SRCS := $(wildcard src/*.rs) Cargo.toml
NAME = xunmi
TARGET = lib$(NAME)
BUILD_FILE = $(BUILD_DIR)/$(TARGET).dylib
BUILD_FILE1 = $(BUILD_DIR)/$(TARGET).so
TARGET_FILE = $(NAME)/$(NAME).so

all: $(TARGET_FILE)

test: $(TARGET_FILE)
	python3 -m pytest

$(TARGET_FILE): $(BUILD_FILE1)
	@cp $(BUILD_FILE1) $(TARGET_FILE)

$(BUILD_FILE1): $(SRCS)
	@cargo build --release
	@mv $(BUILD_FILE) $(BUILD_FILE1)|| true

PHONY: test all

This Makefile can help automate some tasks, generally copying the compiled .dylib or .so into the xunmi directory for Python use.

Writing Code #

Next is how to write FFI shim code. PyO3 provides us with a series of macros that can conveniently map Rust’s data structures, functions, methods, and error types to Python classes, functions, methods, and exceptions. Let’s take a look one by one.

Registering Rust structs as Python classes #

Previously in [Lecture 6], we briefly introduced how functions are imported into pymodule:

use pyo3::{exceptions, prelude::*};

#[pyfunction]
pub fn example_sql() -> PyResult<String> {
    Ok(queryer::example_sql())
}

#[pyfunction]
pub fn query(sql: &str, output: Option<&str>) -> PyResult<String> {
    let rt = tokio::runtime::Runtime::new().unwrap();
    let data = rt.block_on(async { queryer::query(sql).await.unwrap() });
    match output {
        Some("csv") | None => Ok(data.to_csv().unwrap()),
        Some(v) => Err(exceptions::PyTypeError::new_err(format!(
            "Output type {} not supported",
            v
        ))),
    }
}

#[pymodule]
fn queryer_py(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(query, m)?)?;
    m.add_function(wrap_pyfunction!(example_sql, m)?)?;
    Ok(())
}

Using the #[pymodule] macro, we provide the python module entry function, which is responsible for registering the module’s classes and functions. Functions can be registered with m.add_function, and then in Python you can call it like this:

import queryer_py
queryer_py.query("select * from file:///test.csv")

But at the time, the interface we wanted to expose to users was very simple: just let the users input a SQL string and a string for the output type and return a string processed according to the SQL query and appropriate for the output type. So, we provided two interfaces to the Python module: example_sql and query.

However, what we need to do today is far more complex than how we used PyO3 in Lecture 6. For example, when you need to pass a data structure between two languages, allowing Python classes to use Rust methods, etc., so we need to register some classes and corresponding class methods.

Look at the snapshot of the code used above (copied here):

from xunmi import *

indexer = Indexer("./fixtures/config.yml")
updater = indexer.get_updater()
f = open("./fixtures/wiki_00.xml")
data = f.read()
f.close()
input_config = InputConfig("xml", [("$value", "content")], [("id", ("string", "number"))])
updater.update(data, input_config)
updater.commit()

result = indexer.search("历史", ["title", "content"], 5, 0)

You will notice that we need to register Indexer, IndexUpdater, and InputConfig as three classes; they all have their own member functions, and Indexer and InputConfig also must have constructors.

But because xunmi is a crate imported from outside of xunmi-py, we can’t directly move xunmi’s data structures and register these classes. What should we do? We need to wrap them:

use pyo3::{exceptions, prelude::*};
use xunmi::{self as x};

#[pyclass]
pub struct Indexer(x::Indexer);

#[pyclass]
pub struct InputConfig(x::InputConfig);

#[pyclass]
pub struct IndexUpdater(x::IndexUpdater);

There’s a little trick here, which is to temporarily rename xunmi’s namespace to x, so that the structures from xunmi itself are referred to using x::, avoiding naming conflicts.

With these three definitions, we can introduce them into the module through m.add_class:

#[pymodule]
fn xunmi(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<Indexer>()?;
    m.add_class::<InputConfig>()?;
    m.add_class::<IndexUpdater>()?;
    Ok(())
}

Note that the function name must match the lib name; if you do not define a lib name, the crate name is used by default. For distinction, the crate name used was “xunmi-py”, so previously in Cargo.toml, the lib name was declared separately:

[lib]
name = "xunmi"
crate-type = ["cdylib"]

Exposing struct methods as class methods #

After registering the Python classes, we continue to write the implementation, which basically is shim code, just exposing the methods of the corresponding data structures in xunmi to Python. Let’s look at a simple one, the implementation of IndexUpdater:

#[pymethods]
impl IndexUpdater {
    pub fn add(&mut self, input: &str, config: &InputConfig) -> PyResult<()> {
        Ok(self.0.add(input, &config.0).map_err(to_pyerr)?)
    }

    pub fn update(&mut self, input: &str, config: &InputConfig) -> PyResult<()> {
        Ok(self.0.update(input, &config.0).map_err(to_pyerr)?)
    }

    pub fn commit(&mut self) -> PyResult<()> {
        Ok(self.0.commit().map_err(to_pyerr)?)
    }

    pub fn clear(&self) -> PyResult<()> {
        Ok(self.0.clear().map_err(to_pyerr)?)
    }
}

First, you need to use #[pymethods] to wrap impl IndexUpdater {} so that all the pub methods inside it can be used on the Python side. We exposed the add/update/commit/clear methods. The type signatures of the methods can be written normally; Rust’s fundamental types can correspond to Python through PyO3, and the InputConfig we previously registered has also been registered as a Python class.

So, through these methods, a Python user can easily generate strings on the Python side, create the InputConfig class, and then pass it to the update() function to be processed by the Rust side. For example, like this:

f = open("./fixtures/wiki_00.xml")
data = f.read()
f.close()
input_config = InputConfig("xml", [("$value", "content")], [("id", ("string", "number"))])
updater.update(data, input_config)

Error Handling #

Do you still remember the three key points emphasized in the last lecture? Pay attention to Rust’s error handling when writing FFI. Here, all functions that want to return a Result must use PyResult. You need to process your original error type and turn it into a Python error.

We can handle it with map_err, where to_pyerr is implemented as follows:

pub(crate) fn to_pyerr<E: ToString>(err: E) -> PyErr {
    exceptions::PyValueError::new_err(err.to_string())
}

By using PyO3’s PyValueError, the err generated on the Rust side will be converted into an exception on the Python side by PyO3. For example, when we provide a nonexistent config when creating an indexer:

In [3]: indexer = Indexer("./fixtures/config.ymla")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-bde6b0e501ea> in <module>
----> 1 indexer = Indexer("./fixtures/config.ymla")

ValueError: No such file or directory (os error 2)

Even if you used panic! on the Rust side, PyO3 handles it well:

In [3]: indexer = Indexer("./fixtures/config.ymla")
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-11-082d933e67e2> in <module>
----> 1 indexer = Indexer("./fixtures/config.ymla")
      2 updater = indexer.get_updater()

PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

It also throws an exception on the Python side.

Constructors #

Okay, let’s see how to implement the Indexer:

#[pymethods]
impl Indexer {
    // Open or create an index
    #[new]
    pub fn open_or_create(filename: &str) -> PyResult<Indexer> {
        let content = fs::read_to_string(filename).unwrap();
        let config = x::IndexConfig::from_str(&content).map_err(to_pyerr)?;
        let indexer = x::Indexer::open_or_create(config).map_err(to_pyerr)?;
        Ok(Indexer(indexer))
    }

    // Get updater
    pub fn get_updater(&self) -> IndexUpdater {
        IndexUpdater(self.0.get_updater())
    }

    // Search
    pub fn search(
        &self,
        query: String,
        fields: Vec<String>,
        limit: usize,
        offset: Option<usize>,
    ) -> PyResult<Vec<(f32, String)>> {
        let default_fields: Vec<_> = fields.iter().map(|s| s.as_str()).collect();
        let data: Vec<_> = self
            .0
            .search(&query, &default_fields, limit, offset.unwrap_or(0))
            .map_err(to_pyerr)?
            .into_iter()
            .map(|(score, doc)| (score, serde_json::to_string(&doc).unwrap()))
            .collect();

        Ok(data)
    }

    // Reload index
    pub fn reload(&self) -> PyResult<()> {
        self.0.reload().map_err(to_pyerr)
    }
}

You see, we can use #[new] to label the method to become a constructor, so on the Python side, when you call:

indexer = Indexer("./fixtures/config.yml")

In fact, it is calling the open_or_create method on the Rust side. Marking a method used to