14 Storage Optimization Part3 Use and Optimization of the Sqlite Database

14 Storage Optimization Part3 Use and Optimization of the SQLite Database #

Let’s review the usage scenarios of the storage methods we discussed earlier. For small amounts of key-value data, we can directly use SharedPreferences. For slightly more complex data types, we can serialize them into JSON or Protocol Buffers for storage, and accessing or modifying the data is also straightforward in development.

However, can these methods cover all storage scenarios? Their performance is acceptable for data in the range of a few hundred to a few thousand entries, but what if we are dealing with tens of thousands of WeChat chat records? And how do we quickly perform CRUD operations on data for a few specific contacts?

For storage scenarios involving large amounts of data, we need to consider stability, performance, and scalability. This is where today’s “star” - the database - comes into play. Storage optimization cannot avoid the topic of databases, and databases are indeed a very extensive subject. I also know that many students have gone through the process of learning databases from scratch and then giving up. Considering that most of us are engaged in mobile development work, today I will talk about the use and optimization of the SQLite database for mobile platforms.

Things about SQLite #

Although there are many databases on the market, there are not many suitable for mobile use due to their size and storage space limitations. Of course, the most widely used is our protagonist today, SQLite, but there are also some other good choices, such as Realm by start-up teams and LevelDB by Google.

Among the many mobile teams in China, WeChat’s research on SQLite can be considered the most in-depth. This is actually driven by business demands, as user chat records are only saved locally. Once data corruption or loss occurs, it is irretrievable for the user. On the other hand, WeChat has a large number of heavy users, who have thousands of contacts and thousands of group chats. We once conducted a statistics and found that the database size of millions of users exceeds 1GB. For this group of users, ensuring that they can use WeChat normally is a huge challenge.

Therefore, WeChat launched a special project to optimize for heavy users. At first, we focused on optimizing the use of SQLite, such as table structure, indexes, etc. But it was quickly discovered that due to differences in system versions, there were also some compatibility issues with SQLite, and considering the demand for encryption, we decided to introduce our own version of SQLite.

“With the source code in hand, I have the world.” This opened up a “point of no return” for researching databases. At that time, we devoted several people to in-depth study of the SQLite source code, from SQLite’s PRAGMA compile options, cursor implementation optimization, to SQLite source code optimization, and finally created a complete monitoring system from the laboratory to the online environment.

In 2017, we open-sourced the SQLite database we used internally, called WCDB. Let me say a few words here. To see if an open-source project is reliable, it depends on how important the project is to the product itself. WeChat open-source insists on using the same version internally and externally. Although I have left the WeChat team now, I still welcome students who need it to use WCDB.

Before we start learning, I want to remind you that it is also difficult to explain all the details of SQLite optimization with just one or two articles. Today’s content is a selection of some important knowledge points, and I have prepared a lot of reference materials for you. When you encounter unfamiliar or incomprehensible areas, you need to study repeatedly with the help of reference materials.

1. ORM (Object Relational Mapping)

To be honest, many senior development engineers at BAT may not fully understand SQLite’s internal mechanism and cannot write efficient SQL statements. In order to improve development efficiency, most applications will introduce ORM frameworks. ORM (Object Relational Mapping) is a way to associate database tables with objects using object-oriented concepts, allowing us to not have to worry about the low-level implementation of the database.

The most common ORM frameworks in Android are the open-source greenDAO and Google’s official Room. What problems does using an ORM framework bring?

Using an ORM framework is really simple, but simplicity comes at the cost of sacrificing some execution efficiency, and the specific overhead is closely related to how well the ORM framework is written. But an even bigger problem is that it solidifies the way of thinking for many developers, and in the end they may not even know how to write a simple SQL statement.

Should our application introduce an ORM framework then? Perhaps programmers naturally pursue laziness, and in order to improve development efficiency, it makes sense to introduce an ORM framework. However, this cannot be a reason for us to not learn the basics of databases. Only by understanding some underlying mechanisms can we be more capable of solving difficult problems.

Considering better interaction with Android Jetpack components, WCDB chose Room as the ORM framework.

2. Process and Thread Concurrency

If we use SQLite in our project, then the following SQLiteDatabaseLockedException is a problem that often occurs:

android.database.sqlite.SQLiteDatabaseLockedException: database is locked
  at android.database.sqlite.SQLiteDatabase.dbopen
  at android.database.sqlite.SQLiteDatabase.openDatabase
  at android.database.sqlite.SQLiteDatabase.openDatabase

SQLiteDatabaseLockedException is ultimately caused by concurrency issues, and SQLite has concurrency in two dimensions: multi-process concurrency and multi-thread concurrency. Let’s talk about the key points of each.

Multi-process concurrency

SQLite is by default capable of supporting multi-process concurrent operations, using file locks to control multi-process concurrency. The lock granularity of SQLite is not very fine-grained, as it is based on the entire DB file. There are 5 states internally, and you can refer to the following articles for more details:

In short, multiple processes can acquire SHARED locks to read data concurrently, but only one process can acquire an EXCLUSIVE lock to write to the database. For iOS, there may not be a scenario of multiple process accessing the database, so the default value of locking_mode can be changed to EXCLUSIVE.

PRAGMA locking_mode = EXCLUSIVE

In EXCLUSIVE mode, the database connection will not release the SQLite file lock until it is disconnected, thus avoiding unnecessary conflicts and improving the speed of database access.

Multi-threaded Concurrency

Compared to multiple processes, multi-threaded database access may be more common. SQLite supports multi-threaded concurrency mode, which requires the following configuration to be enabled. However, the system SQLite will enable multi-threading by default (Multi-thread Mode).

PRAGMA SQLITE_THREADSAFE = 2

Similar to the lock mechanism for multiple processes, the lock granularity in SQLite is at the database file level and does not support table-level or even row-level locks. It is also worth mentioning that only one thread can operate on the same handle at a time. In this case, we need to open a connection pool.

When using WCDB, the connection pool size can be specified during initialization. In WeChat, we set the size to 4.

public static SQLiteDatabase openDatabase (String path, 
                        SQLiteDatabase.CursorFactory factory, 
                        int flags, 
                        DatabaseErrorHandler errorHandler, 
                        int poolSize)

Similar to multiple processes, multiple threads can read database data concurrently, but writing to the database is still mutually exclusive. SQLite provides a Busy Retry solution, which triggers a Busy Handler when blocking occurs. The thread can sleep for a period of time and then retry the operation. You can refer to the article 《WeChat iOS SQLite Source Code Optimization Practice》 for more information.

To further improve concurrency performance, we can also enable the Write-Ahead Logging (WAL) mode. In WAL mode, modified data is written to a separate WAL file and it also introduces a WAL log file lock. With this mode, reading and writing can be executed entirely in parallel without blocking each other.

PRAGMA schema.journal_mode = WAL

However, it is important to note that writing is still not allowed to occur concurrently. If multiple write operations occur concurrently, there is still a possibility of encountering SQLiteDatabaseLockedException. In this case, the application can catch this exception and wait for a period of time before retrying the operation.

} catch (SQLiteDatabaseLockedException e) {
        if (sqliteLockedExceptionTimes < (tryTimes - 1)) {
            try {
Thread.sleep(100);
} catch (InterruptedException e1) {
}
}
sqliteLockedExceptionTimes++;
}

**Overall, by using connection pooling and WAL mode, we can significantly increase the concurrency of SQLite's read and write operations, greatly reducing waiting time caused by concurrency. I recommend everyone to try enabling these features in their applications.**

**3. Query Optimization**

When it comes to database query optimization, the first thing that comes to mind is indexing. So let me start by discussing SQLite's indexing optimization.

**Index Optimization**

Proper use of indexing can greatly reduce query time in most scenarios, and WeChat's database optimization also starts with indexing. The example below demonstrates a very simple use case of an index, where we first find the rowid of the data in the index table, and then query the original data table directly using the rowid.

![Index Example](../images/4db47f0b379f4bd98d0c895a00c730b0.jpg)

There are many articles online that discuss the principles of SQLite indexing. Here are some recommended resources:

* [Principles of SQLite Index](https://www.cnblogs.com/huahuahu/p/sqlite-suo-yin-de-yuan-li-ji-ying-yong.html) (Chinese)

* Official Documentation: [Query Planning](https://www.sqlite.org/queryplanner.html#searching)

* [Data Structures and Algorithm Principles behind MySQL Indexing](http://blog.codinglabs.org/articles/theory-of-mysql-index.html)

The key here is to properly establish indexes. Many times, we may think that we have created indexes, but they may not be effective in reality. For example, when using operators such as BETWEEN, LIKE, OR, or when using expressions or case when statements. More detailed rules can be found in the official documentation: [The SQLite Query Optimizer Overview](http://www.sqlite.org/optoverview.html). The example below shows how to optimize a query to make use of an index:

```sql
BETWEEN: the myfield index is not effective
SELECT * FROM mytable WHERE myfield BETWEEN 10 and 20;
Converted to: the myfield index is effective
SELECT * FROM mytable WHERE myfield >= 10 AND myfield <= 20;

Creating an index has its costs since maintaining the index table involves updates. For example, it is not necessary to create an index for a very small table. If a table frequently undergoes insert or update operations, it is also necessary to refrain from creating indexes. In general, there are a few principles to keep in mind:

  • Establish correct indexes. Here, it is not only necessary to ensure that indexes are effective in queries, but also to choose the most efficient indexes. If a table has too many indexes, SQLite may not choose the best one for execution during queries.

  • Selecting between single-column indexes, multi-column indexes, and composite indexes. Indexes should be considered together with different queries and sorting statements in the data table. If the query result set is large, it is better to use a composite index to directly return the query result in the index table.

  • Selecting index fields. Index efficiency for integer types is significantly higher than for string indexes. Additionally, SQLite automatically creates indexes for primary keys, so try to avoid using complex fields as primary keys.

In summary, index optimization is the simplest and most effective method of SQLite optimization. However, it is not as simple as creating an index - sometimes we need to further adjust the query statement or even the table structure to achieve the best result.

Page Size and Cache Size

In an I/O file system, I mentioned that a database is like a small file system, and internally, it also has the concepts of pages and cache.

For SQLite’s DB file, a page is the smallest storage unit. Each table’s data is stored in the entire DB file as pages. Different pages of the same table are organized as a B-tree index. Each table is represented as a B-tree.

SQLite B-tree

Similar to the page cache in a file system, SQLite caches the read pages to speed up subsequent reads. The default page size is 1024 bytes, and the default cache size is 1000 pages. More compile-time parameters can be found in the official documentation: PRAGMA Statements.

PRAGMA page_size = 1024;
PRAGMA cache_size = 1000;

Each page can only hold the data of one table or a group of indexes, meaning that multiple tables or indexes cannot coexist in the same page. The root page of a B-tree is the first page of a table in the entire DB file. Using the example above, if we want to query the data with rowid N+2, we first need to find the location of the table’s root page from sqlite_master, and then read the root page and page 4. Therefore, a total of three I/O operations are required.

SQLite Page Reading

From the table above, we can see that increasing the page size cannot continuously improve performance and may even have side effects after a certain point. We can use PRAGMA to change the default page size, or set it during the creation of the DB file. However, keep in mind that if old data exists, vacuum needs to be called to recalculate the allocation size for table data nodes.

In WeChat’s internal testing, using a 4KB page size improved performance by 5% to 10%. However, considering the migration cost of historical data, 1024 bytes was ultimately chosen. Therefore, I recommend selecting 4KB as the default page size when creating a new database to achieve better performance.

Other Optimizations

There are many other optimizations that can be done with SQLite. Here are a few key points:

  • Be cautious with “SELECT *”. Only select the columns that are necessary.

  • Use transactions correctly.

  • Pre-compile and bind parameters to cache compiled SQL statements.

  • For BLOB or large TEXT columns that may exceed a page size, leading to oversized pages, it is recommended to split these columns into separate tables or place them at the end of table fields.

  • Regularly organize or clean up useless or removable data. For example, the WeChat Moments database deletes data that is relatively old, and if a user accesses this data, it can be re-fetched from the network.

In our daily development, we should have a good understanding of these concepts and then review the SQLite optimization methods we have learned. By introducing ORM, we can greatly improve development efficiency. By using WAL mode and connection pooling, we can improve SQLite’s concurrency performance. By establishing indexes correctly, we can improve SQLite query speed. By adjusting the default page size and cache size, we can improve overall SQLite performance.

Other Features of SQLite #

In addition to the optimization experience of SQLite, I have accumulated a lot of experience in my work at WeChat, and I have chosen a few important ones to share with you.

1. Corruption and Recovery

The corruption rate of SQLite in WeChat is around 1/20000 to 1/10000, which may seem low, but considering the scale of WeChat, this problem cannot be ignored. Especially if important chat records of some influential people are lost, our team will bear tremendous pressure.

Innovation is driven by anxiety, and technology is often forced out. Research on SQLite corruption and recovery can be said to be an area where WeChat has invested a lot. For information about SQLite database corruption and repair, as well as WeChat’s optimization achievements in this area, you can refer to the following resources:

2. Encryption and Security

Database security mainly involves two aspects: injection prevention and encryption. Injection prevention can be done through static security scans, while encryption is typically supported by SQLCipher.

Encryption and decryption in SQLite are performed on a page-by-page basis and are encrypted by default using the AES algorithm. The time required for encryption/decryption depends on the length of the chosen key. Below are the results from the WCDB Android Benchmark, please refer to the link for detailed information. In terms of conclusion, the impact on Create can be up to 10 times.

For usage of WCDB encryption and decryption, you can refer to WeChat Mobile Database Component WCDB (Part IV) — Android Features.

3. Full-text Search

WeChat’s full-text search is also a technology-driven project. At the beginning, the performance was not ideal and often faced criticism. After several versions of optimization iterations, the current effect is very good.

For information about full-text search, you can refer to these resources:

Regarding these features of SQLite, we need to consider them comprehensively based on our project situation. If the data stored in a certain database is not important, then we may not care about a one in ten thousand corruption rate. Similarly, whether to use database encryption should also depend on whether the stored data is sensitive or not.

Monitoring SQLite #

First of all, I want to say that it is important to use indexes correctly and use transactions properly. For large projects, there may be dozens or hundreds of developers involved, and their levels of expertise may vary. It is difficult to ensure that everyone can use SQLite correctly and efficiently. Therefore, it is necessary to establish a comprehensive monitoring system.

1. Local Testing

As a reliable developer, we should test every SQL statement locally before proceeding. We can use EXPLAIN QUERY PLAN to test the query plan of an SQL statement, such as whether it performs a full table scan or uses indexes, and which specific indexes are used.

sqlite> EXPLAIN QUERY PLAN SELECT * FROM t1 WHERE a=1 AND b>2;
QUERY PLAN
|--SEARCH TABLE t1 USING INDEX i2 (a=? AND b>?)

For more information on using SQLite command line and EXPLAIN QUERY PLAN, please refer to Command Line Shell For SQLite and EXPLAIN QUERY PLAN.

2. Time Monitoring

Local testing relies too much on the developers’ awareness, so in many cases, we still need to establish monitoring for production data. Since WeChat integrates its own SQLite source code, it is very convenient to add the monitoring modules we want.

WCDB provides the monitoring module SQLiteTrace, which has the following three interfaces:

We can use these interfaces to monitor busy events, slow queries, and execution times. For SQL statements with long execution times, further inspection is needed to determine whether the statements are poorly written or require index creation.

3. Intelligent Monitoring

Monitoring query results is only the second stage of our monitoring evolution. At this stage, human intervention is still required for analysis, and experienced personnel should be responsible.

We hope that the analysis of SQL statements can be done intelligently, without any barriers. WeChat’s open-source Matrix provides a tool called Matrix SQLiteLint - Quality Inspection for SQLite Usage, which intelligently analyzes SQLite statements based on the syntax tree and combines our daily experience with database usage. It abstracts six major issues, such as inappropriate index usage and the use of select*.

Some may wonder why the WeChat team came up with this approach. In fact, this approach is quite common in MySQL. Meituan has also open-sourced their internal SQL optimization tool, SQLAdvisor. You can refer to these resources for more information:

Summary #

Database storage is a fundamental skill for developers, and having a clear understanding of SQLite’s underlying mechanisms can provide significant guidance in our work.

By mastering the concurrency mechanism of the SQLite database, we can make better decisions about whether to split data tables or databases. The advantage of creating a new database is that it can isolate other databases from concurrency or corruption situations, but the disadvantage is that it takes time to initialize the database and occupies more memory. Generally, separate businesses will use independent databases, such as dedicated download databases, moments databases, and chat databases. However, we should not have too many databases. We can have a common database to store relatively small data.

After understanding the principles and probabilities of SQLite database corruption, we can decide whether to introduce recovery mechanisms based on the importance of the data. I also mentioned how to implement database encryption and its impact on performance. We can decide whether to introduce encryption based on the sensitivity of the data.

Finally, I want to emphasize again that SQLite optimization is really a huge topic. You need to continue studying with reference materials after class to thoroughly understand today’s content.

Homework #

Do you use a database in your application, and if so, which database do you use? Do you use an ORM? Do you have any questions or experiences in the process of using a database? Feel free to leave a message to discuss with me and other classmates.

If your application also uses SQLite as storage, today’s homework is to try integrating WCDB and compare the performance with the system’s default SQLite. Try integrating Matrix SQLiteLint to see if there are any improper uses of SQLite.

In addition to the reference materials mentioned in today’s article, I have also prepared the following materials for students who want to advance further. Feel free to continue studying if you are interested.

Feel free to click “Share with Friends” and share today’s content with your friends, inviting them to learn together. Finally, don’t forget to submit today’s homework in the comments section. I have also prepared a generous “Study Cheer Package” for students who complete the homework seriously. Looking forward to progressing together with you.