38 My SQL Logs and Data Storage Systems

38 MySQL Logs and Data Storage Systems #

Hello, I’m Jingxiao. In today’s class, we will talk about log and storage systems.

In Internet companies, the log system is a very important underlying technology. In every important interaction, key information is recorded and archived for offline analysis or real-time analysis online. These data can even be said to be the lifeblood of Silicon Valley internet giants.

With them, you can build machine learning models to predict user behavior, accurately describe user portraits, and then use recommendation systems and classifiers to retain users and push targeted ads for profit.

In quantitative trading, logs also play a very important role. As mentioned earlier, our important data includes market data, strategy signals, execution status, and position information, among many others.

For simple, small-scale data, such as order book information, we can completely store the data in txt or csv files, which is simple and efficient. However, the disadvantage is that as the data volume increases, a single file will become very large and retrieval will not be easy. At this point, a very intuitive approach emerges. We can store the data for each day in a separate file, which temporarily alleviates the awkward situation.

However, as the data volume increases, or when your algorithm gradually enters the high-frequency trading field, simply storing data in files is no longer sufficient to meet new requirements and is unable to cope with the needs of distributed quantitative trading systems. Therefore, an obvious idea is to store the logs in a database system.

In this class, we will use MySQL, a traditional relational database, as an example to explain the application of databases in logs.

Understanding MySQL Quickly #

To begin with, let me give a brief introduction to the MySQL database as some of you may not have a basic understanding of databases.

MySQL is a typical Relational Database Management System (RDBMS). In the context of databases, a relational database refers to a database that is built on the basis of the relational model, which utilizes mathematical concepts and methods such as set algebra to process data in the database. Almost any learning material will tell you that it has the following characteristics:

  1. Data is presented in the form of tables.
  2. Each row represents a record with various field names.
  3. Each column corresponds to the data domain of a field name.
  4. Multiple rows and columns make up a table.
  5. Multiple tables make up a database.

However, leaving aside these abstract characteristics, the concepts you need to grasp first are as follows:

  • A database is a collection of associated tables, and a data table is a matrix of data. In a database, a data table looks like a simple spreadsheet.
  • In a data table, each column contains data of the same type, and each row is a set of related data.
  • The primary key is a column in the data table. Each element in this column is unique, and a data table can only contain one primary key. Foreign keys, on the other hand, are used to associate two tables.

In addition to these, you also need to understand indexes. An index is a structure that sorts the values in one or more columns of a database table. By using indexes, we can quickly access specific information in a database table. Generally, you can set indexes for many columns, which greatly speeds up the process of retrieving a specified column. However, the trade-off is that inserting data will become slower.

As for operating MySQL, Structured Query Language (SQL) is commonly used. SQL is a typical domain-specific language (DSL), and I won’t go into much detail here. If you’re interested, you can learn from the “SQL Must-Know” column on Geek Time platform.

Next, let’s take a brief look at how to use Python to operate your MySQL database.

There are many ways to connect Python to a database, and here I will briefly introduce two of them. Let’s take Ubuntu as an example and assume that you have already installed MySQL Server on your system. (You can refer to this article for installing MySQL: https://www.jianshu.com/p/3111290b87f4, or you can search and solve it yourself.)

mysqlclient #

In fact, the most popular Python driver for connecting to MySQL is MySQL-python, also known as MySQLdb, and many frameworks are developed based on this library. Unfortunately, it only supports Python 2.x and requires many prerequisites to install. Since it is a library developed based on C, it is very unfriendly to install on the Windows platform and failures often occur. Therefore, we basically do not recommend using it anymore. Its successor is a derivative version - mysqlclient.

mysqlclient is fully compatible with MySQLdb and supports Python 3.x at the same time. It is a dependency tool for Django ORM. If you want to use native SQL to operate the database, then I recommend using this framework first.

The installation is very simple:

sudo apt-get install python3-dev
pip install mysqlclient

Let’s take a look at a sample code:

import MySQLdb

def test_pymysql():
    conn = MySQLdb.connect(
        host='localhost',
        port=3306,
        user='your_username',
        passwd='your_password',
        db='mysql'
    )
    
    cur = conn.cursor()
    cur.execute('''
        CREATE TABLE price (
            timestamp TIMESTAMP NOT NULL,
            BTCUSD FLOAT(8,2),
            PRIMARY KEY (timestamp)
        );
    ''')
    cur.execute('''
        INSERT INTO price VALUES (
            "2019-07-14 14:12:17",
            ...
        );
    ''')

Be careful: the text you provided is incomplete. It seems that something is missing in the sample code. If you need further assistance, let me know.

11234.56
);

conn.commit()
conn.close()


test_pymy

The code's logic is very clear and straightforward. First, it connects to the database through the connect command to create a connection; then, it creates a cursor using the conn.cursor() function. You might wonder, why use a cursor?

One main reason is that it allows you to convert set operations into a way of processing single records. When retrieving data from the database using the SQL language, the results are usually stored in a memory area and often consist of a collection of multiple records. The cursor mechanism allows users to access these records one by one in MySQL, so you can display and process these records according to your own preferences.

Continuing with the code, we create a price table and insert an orderbook data into it. Here, for simplicity and to highlight the key points, I only kept the timestamp and price.

Finally, we use conn.commit() to commit the changes and then close() the connection.

### Peewee

However, people gradually found that writing raw SQL commands was cumbersome. Because you need to construct specific insert and query statements based on specific business logic, this completely abandons the object-oriented thinking. Therefore, many encapsulated wrapper packages and ORM frameworks were born.

The ORM (Object Relational Mapping) mentioned here is a mapping relationship between Python objects and database tables. With an ORM, we no longer need to write SQL statements and can directly use Python data structures.

The advantages of ORM frameworks are that they improve coding speed and support multiple database systems such as SQLite, MySQL, PostgreSQL, and others. The cost paid is perhaps some performance loss.

Peewee, which I am going to talk about next, is one such Python-based ORM framework. It has a very low learning curve and can be considered the most popular ORM framework in Python.

Its installation is also very simple:

pip install peewee


Let's take a look at a sample code:

import peewee from peewee import *

db = MySQLDatabase(‘mysql’, user=‘your_username’, passwd=your_password’)

class Price(peewee.Model): timestamp = peewee.DateTimeField(primary_key=True) BTCUSD = peewee.FloatField()

class Meta:
    database = db

def test_peewee(): Price.create_table() price = Price(timestamp=‘2019-06-07 13:17:18’, BTCUSD=‘12345.67’) price.save()

test_p


If you have experience with Django, you will find that this syntax is almost identical to Django. We map a MySQL table with a Python class, and as long as we define the format of each column, we can operate on it in a Pythonic way.

Clearly, the biggest advantage of peewee is that it transforms SQL language into a strongly typed language instantly. This not only greatly enhances readability but also effectively reduces the probability of bugs.

However, in reality, as a data scientist or a quant, the data you deal with is much more complex than this. The Internet industry has a lot of dirty data, and the signal-to-noise ratio in the financial industry is extremely low. Data processing can only be considered as a basic skill.

If you are interested in data analysis and have ambitions in this field, you should first solidify your foundation in mathematics and statistics during your student years, and then quickly grasp data processing methods through internships and work experience. Of course, if you have missed your student years, starting now is also a good choice. After all, gradually forming your own core competitiveness is the right path for each of us.

Quantitative Data Analysis System #

With the quantitative data stored in the database, we can now start some quantitative analysis. This is also a vast academic field called time series analysis. However, for the topic of today’s lesson, we will only provide a very simple example, that is, finding the highest and lowest prices of BTC/USD in the past hour.

Let’s take a look at the following code:

import MySQLdb
import numpy as np


def test_pymysql():
    conn = MySQLdb.connect(
        host='localhost',
        port=3306,
        user='your_username',
        passwd='your_password',
        db='mysql'
    )

    cur = conn.cursor()
    cur.execute('''
            SELECT
              BTCUSD
            FROM
              price
            WHERE
              timestamp > now() - interval 60 minute
    ''')

    BTCUSD = np.array(cur.fetchall())
    print(BTCUSD.max(), BTCUSD.min())

    conn.close()


test_pym

The code looks simple, right? Obviously, through the SQL statement, we can fetch a time series segment of the past hour, get the BTC/USD price vector we want, and then process it using numpy. However, one thing to note here is that we don’t need to call conn.commit(), because our operation is read-only and has no impact on the database.

Distributed Log System #

Now that we understand the content above, let’s take a look at distributed log systems.

For quantitative trading, the main modules we need are the data system, strategy system, trade execution system, offline model training, online risk control system, and real-time monitoring system. I have drawn a diagram to show the corresponding relationships between these modules, which you can refer to for understanding.

Each subsystem here runs independently, and there are also many modules that need to be iteratively updated. Therefore, simply saving logs locally is obviously not a wise move. So we can dedicate a server to run the MySQL server and open specified ports to interact with other systems.

In addition, the collection system in the diagram is similar to the message queue system we talked about in the previous section. Proxy tools run in various upstream systems to collect logs from different modules and send them to the collection system. After the collection system organizes the information, it is stored in the log system. Of course, besides simple message queues, there are many tools we can use, such as Aliyun’s Logtail and Apache’s Flume Agent.

As the system develops, more and more attention needs to be paid to storage efficiency and analysis efficiency in the log system. With the increase in usage, the data will become more and more abundant, so we can consider compressing and archiving some of the data. And as the data becomes more obsolete and with coarser granularity, the probability of it being called will be lower. Therefore, it becomes the prime target for compression and archiving.

Log Analysis #

Finally, let me talk about log analysis. As mentioned earlier, analysis is generally divided into two types: offline analysis and online analysis.

In offline analysis, generating reports is quite common.

For example, summarizing the profit and loss (PnL), maximum drawdown, Sharpe ratio, and other data for a certain day, month, or quarter. This time-window-based statistics can also be conveniently supported by relational databases.

Another common offline usage is in backtesting systems. In a new strategy development cycle, we need to backtest historical data to obtain data such as the return on investment from historical trading. Backtesting systems are crucial for evaluating a new strategy, however, backtesting often requires a large amount of resources. Therefore, selecting a good database, data storage method, optimizing data connections and calculations becomes extremely important.

Online analysis is more commonly used in risk control and alert systems. This method requires higher real-time data requirements. One approach is to directly retrieve the fastest data from the message queue. Of course, this premise is that the time window is relatively small, so you don’t need the risk control system to maintain a large amount of local data.

As for real-time alerts, data is still the most critical factor.

  • For example, if the data system stops abnormally or the monitored table is not updated;
  • Or, if the connection of the trading system fails and certain order status exceeds a certain threshold;
  • Or, if there are significant and unexpected changes in position information.

These situations require an alarm, which is what Silicon Valley companies call “oncall”. When an accident occurs, the responsible person will quickly receive a call, text message, and email. They will then use monitoring platforms to confirm whether it is a real accident or a false alarm from the monitoring system.

Of course, there are now many open-source tools that can be used in the cloud, among which AWS is a world-leading cloud computing platform. If your servers are located in the United States, you can consider using various cloud services provided by AWS. The advantage of doing so is that for small quant trading teams, they can avoid building complex log systems themselves and instead focus their main efforts on strategy development and iteration, which greatly improves efficiency.

Summary #

In this lesson, I introduced the storage system in quantitative trading systems from an engineering perspective. We started with the basic usage of MySQL and then moved on to the framework of quantitative trading systems. Databases and data are core components in the majority of the internet industry and are also important production materials for quantitative practitioners. Building a well-balanced and reliable data system requires continuous efforts from a quantitative team and iterative improvements based on the specific needs.

Thought Questions #

Lastly, I’ll leave you with a thought question. Quantitative trading doesn’t require a large amount of data, but it can have a high frequency of calls, such as in backtesting systems. So, can you think of any optimization methods to reduce the cost of these calls? Feel free to leave a comment and discuss with me, and also feel free to share this article.