26 Data Storage No SQL vs Rdbms How to Complement Shortcomings With Advantages

26 Data Storage No SQL vs Rdbms How to Complement Shortcomings With Advantages

26 Data Storage NoSQL vs RDBMS How to Complement Shortcomings with Advantages #

Today, let’s talk about common mistakes in data storage.

In recent years, various non-relational databases, also known as NoSQL, have developed rapidly and are very common in projects. However, there are often extreme cases of using NoSQL, such as directly replacing all relational databases (RDBMS) with NoSQL or using NoSQL incorrectly in unsuitable scenarios.

In fact, each NoSQL database has different characteristics and focuses on solving certain specific problems. Therefore, when using NoSQL, we should try to let it handle the scenarios it is good at, otherwise we will not only fail to benefit from its functions and advantages, but also may result in performance issues.

NoSQL databases can generally be divided into cache databases, time-series databases, full-text search databases, document databases, graph databases, etc. Today, I will take cache database Redis, time-series database InfluxDB, and full-text search database ElasticSearch as examples. Through some test cases, I will discuss the characteristics of these common NoSQL databases, as well as their strengths and weaknesses. Finally, I will also talk about how NoSQL and RDBMS complement each other to form a composite database system that can handle high concurrency.

Taking Advantages and Supplementing Shortcomings: Redis vs MySQL #

Redis is a simple and efficient in-memory caching database, making it highly performant for single-key read and write operations.

Let’s perform a simple test by inserting 100,000 rows of data into Redis and MySQL respectively. In MySQL, the name field serves as an index, similar to the Key in Redis, and the data field contains 100 bytes of data, equivalent to the Value in Redis:

@SpringBootApplication
@Slf4j
public class CommonMistakesApplication {

    // populate 100,000 data into Redis and MySQL
    public static final int ROWS = 100000;
    public static final String PAYLOAD = IntStream.rangeClosed(1, 100).mapToObj(__ -> "a").collect(Collectors.joining(""));

    @Autowired
    private StringRedisTemplate stringRedisTemplate;

    @Autowired
    private JdbcTemplate jdbcTemplate;

    @Autowired
    private StandardEnvironment standardEnvironment;

    public static void main(String[] args) {
        SpringApplication.run(CommonMistakesApplication.class, args);
    }

    @PostConstruct
    public void init() {
        // start the program with '-Dspring.profiles.active=init' to perform initialization
        if (Arrays.stream(standardEnvironment.getActiveProfiles()).anyMatch(s -> s.equalsIgnoreCase("init"))) {
            initRedis();
            initMySQL();
        }
    }

    // populate data into MySQL
    private void initMySQL() {
        // drop table
        jdbcTemplate.execute("DROP TABLE IF EXISTS `r`;");

        // create table with `name` field as the index
        jdbcTemplate.execute("CREATE TABLE `r` (\n" +
                "  `id` bigint(20) NOT NULL AUTO_INCREMENT,\n" +
                "  `data` varchar(2000) NOT NULL,\n" +
                "  `name` varchar(20) NOT NULL,\n" +
                "  PRIMARY KEY (`id`),\n" +
                "  KEY `name` (`name`) USING BTREE\n" +
                ") ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;");

        // batch insert data
        String sql = "INSERT INTO `r` (`data`,`name`) VALUES (?,?)";
        jdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {

            @Override
            public void setValues(PreparedStatement preparedStatement, int i) throws SQLException {
                preparedStatement.setString(1, PAYLOAD);
                preparedStatement.setString(2, "item" + i);
            }

            @Override
            public int getBatchSize() {
                return ROWS;
            }
        });

        log.info("init mysql finished with count {}", jdbcTemplate.queryForObject("SELECT COUNT(*) FROM `r`", Long.class));
    }

    // populate data into Redis
    private void initRedis() {
        IntStream.rangeClosed(1, ROWS).forEach(i -> stringRedisTemplate.opsForValue().set("item" + i, PAYLOAD));
        log.info("init redis finished with count {}", stringRedisTemplate.keys("item*"));
    }
}

After launching the program, the following log is outputted, indicating the completion of data population:

[14:22:47.195] [main] [INFO ] [o.g.t.c.n.r.CommonMistakesApplication:80  ] - init redis finished with count 100000
[14:22:50.030] [main] [INFO ] [o.g.t.c.n.r.CommonMistakesApplication:74  ] - init mysql finished with count 100000

Next, let’s compare the performance of randomly retrieving a single piece of data from MySQL and Redis. In order to maintain fairness, just like Redis, we use the Key to look up the Value when using MySQL, where the Key is represented by the name field and the Value corresponds to the data field. Moreover, we have created an index for the name field:

@Autowired
private JdbcTemplate jdbcTemplate;

@Autowired
private StringRedisTemplate stringRedisTemplate;

@GetMapping("redis")
public void redis() {
    // Use a random Key to retrieve the Value, which should be equal to PAYLOAD
    Assert.assertTrue(stringRedisTemplate.opsForValue().get("item" + (ThreadLocalRandom.current().nextInt(CommonMistakesApplication.ROWS) + 1)).equals(CommonMistakesApplication.PAYLOAD));
}

@GetMapping("mysql")
public void mysql() {
    // Use a random name to query the data; the name field has an index, and the result should be equal to PAYLOAD
    Assert.assertTrue(jdbcTemplate.queryForObject("SELECT data FROM `r` WHERE name=?", new Object[]{("item" + (ThreadLocalRandom.current().nextInt(CommonMistakesApplication.ROWS) + 1))}, String.class)
            .equals(CommonMistakesApplication.PAYLOAD));
}

On my computer, I conducted a load test using wrk with 10 threads and 50 concurrent connections. As observed, 90% of the MySQL requests take 61ms with a QPS of 1460, while 90% of the Redis requests take around 5ms with a QPS of 14008, almost ten times higher than MySQL:

However, Redis falls short in Key searches. In the case of MySQL, we can use the LIKE operation to perform prefix matching for faster searches utilizing the B+ tree index. On the other hand, in Redis, using the KEYS command for Key searches is equivalent to a full table scan in MySQL.

Here’s a code snippet comparing their performances:

@GetMapping("redis2")
public void redis2() {
    // Assert that the number of keys that match "item71*" equals 1111
    Assert.assertTrue(stringRedisTemplate.keys("item71*").size() == 1111);
}

@GetMapping("mysql2")
public void mysql2() {
    // Assert that the number of names matching "item71%" equals 1111
    Assert.assertTrue(jdbcTemplate.queryForList("SELECT name FROM `r` WHERE name LIKE 'item71%'", String.class).size() == 1111);
}

As observed, in terms of QPS, MySQL achieves 157 times the QPS of Redis. Regarding latency, MySQL’s latency is only one-tenth of Redis’s latency:

There are two reasons for Redis’s slowness:

The KEYS command in Redis has an O(n) time complexity. This operation becomes considerably slow if the number of keys in the database is large.
Redis is a single-threaded application. Slow commands, when invoked concurrently, will be executed in a serial manner and, therefore, take a long time to process.

Generally, Redis is primarily used as a caching auxiliary for relational databases in most business scenarios, rather than being used as an independent database.

It is worth mentioning that Redis provides a variety of data structures (e.g., Sets, SortedSets, Hashes, Lists) and offers rich APIs around them. By utilizing this feature, we can directly perform some server-side calculations in Redis, thereby avoiding the overhead of “read cache -> calculate data -> save cache,” and further improving performance.

Taking advantage of each other’s strengths: InfluxDB vs MySQL #

InfluxDB is an excellent time-series database. In the “Ready for production” lecture, we used InfluxDB for metrics tracking. The advantage of a time-series database lies in its ability to handle aggregation of metric data and its high read and write efficiency.

Similarly, we conducted some tests to compare the performance of InfluxDB and MySQL.

In the following code, we populated 10 million rows of data into both MySQL and InfluxDB. Each row of data only contains three columns: ID, timestamp, and a random value from 1 to 10000. For MySQL, we indexed the timestamp column:

@SpringBootApplication
@Slf4j
public class CommonMistakesApplication {

    public static final int ROWS = 10000000;

    @Autowired
    private JdbcTemplate jdbcTemplate;

    @Autowired
    private StandardEnvironment standardEnvironment;

    @PostConstruct
    public void init() {
        if (Arrays.stream(standardEnvironment.getActiveProfiles()).anyMatch(s -> s.equalsIgnoreCase("init"))) {
            initInfluxDB();
            initMySQL();
        }
    }

    private void initMySQL() {
        long begin = System.currentTimeMillis();

        jdbcTemplate.execute("DROP TABLE IF EXISTS `m`;");

        jdbcTemplate.execute("CREATE TABLE `m` (\n" +
                "  `id` bigint(20) NOT NULL AUTO_INCREMENT,\n" +
                "  `value` bigint NOT NULL,\n" +
                "  `time` timestamp NOT NULL,\n" +
                "  PRIMARY KEY (`id`),\n" +
                "  KEY `time` (`time`) USING BTREE\n" +
                ") ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;");

        String sql = "INSERT INTO `m` (`value`,`time`) VALUES (?,?)";

        jdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {

            @Override
            public void setValues(PreparedStatement preparedStatement, int i) throws SQLException {
                preparedStatement.setLong(1, ThreadLocalRandom.current().nextInt(10000));
                preparedStatement.setTimestamp(2, Timestamp.valueOf(LocalDateTime.now().minusSeconds(5 * i)));
            }

            @Override
            public int getBatchSize() {
                return ROWS;
            }

        });

        log.info("init mysql finished with count {} took {}ms", jdbcTemplate.queryForObject("SELECT COUNT(*) FROM `m`", Long.class), System.currentTimeMillis()-begin);
    }

    private void initInfluxDB() {
        long begin = System.currentTimeMillis();

        OkHttpClient.Builder okHttpClientBuilder = new OkHttpClient().newBuilder()
                .connectTimeout(1, TimeUnit.SECONDS)
                .readTimeout(10, TimeUnit.SECONDS)
                .writeTimeout(10, TimeUnit.SECONDS);

        try (InfluxDB influxDB = InfluxDBFactory.connect("http://127.0.0.1:8086", "root", "root", okHttpClientBuilder)) {

Keep in mind that this code is only for testing purposes, and the specific configurations and database connections may need customization in a real production environment.

String db = "performance";

influxDB.query(new Query("DROP DATABASE " + db));

influxDB.query(new Query("CREATE DATABASE " + db));

// Set the database

influxDB.setDatabase(db);

// Batch insert, flush every 10,000 records or every 1 second

influxDB.enableBatch(BatchOptions.DEFAULTS.actions(10000).flushDuration(1000));

IntStream.rangeClosed(1, ROWS).mapToObj(i -> Point

        .measurement("m")

        .addField("value", ThreadLocalRandom.current().nextInt(10000))

        .time(LocalDateTime.now().minusSeconds(5 * i).toInstant(ZoneOffset.UTC).toEpochMilli(), TimeUnit.MILLISECONDS).build())

        .forEach(influxDB::write);

influxDB.flush();

log.info("init influxdb finished with count {} took {}ms", influxDB.query(new Query("SELECT COUNT(*) FROM m")).getResults().get(0).getSeries().get(0).getValues().get(0).get(1), System.currentTimeMillis()-begin);
}

}

After starting, the program output the following logs:

[16:08:25.062] [main] [INFO ] [o.g.t.c.n.i.CommonMistakesApplication:104 ] - init influxdb finished with count 1.0E7 took 54280ms

[16:11:50.462] [main] [INFO ] [o.g.t.c.n.i.CommonMistakesApplication:80 ] - init mysql finished with count 10000000 took 205394ms

InfluxDB only took 54 seconds to batch insert 10 million data, equivalent to inserting 180,000 data per second, which is quite fast. The batch insertion speed of MySQL is also fast, reaching 48,000 data per second.

Next, let’s test.

Perform a calculation on these 10 million data, query the data of the last 60 days, aggregate by hour, and calculate the maximum value, minimum value, and average value of the ‘value’ column. Then, plot the result as a curve chart:

@Autowired
private JdbcTemplate jdbcTemplate;

@GetMapping("mysql")
public void mysql() {
    long begin = System.currentTimeMillis();
    // Use SQL to query from MySQL, group by hour
    Object result = jdbcTemplate.queryForList("SELECT date_format(time,'%Y%m%d%H'),max(value),min(value),avg(value) FROM m WHERE time>now()- INTERVAL 60 DAY GROUP BY date_format(time,'%Y%m%d%H')");
    log.info("took {} ms result {}", System.currentTimeMillis() - begin, result);
}

@GetMapping("influxdb")
public void influxdb() {
    long begin = System.currentTimeMillis();
    try (InfluxDB influxDB = InfluxDBFactory.connect("http://127.0.0.1:8086", "root", "root")) {
        // Switch database
        influxDB.setDatabase("performance");
        // InfluxQL, the query language of InfluxDB, is similar to SQL
        Object result = influxDB.query(new Query("SELECT MEAN(value),MIN(value),MAX(value) FROM m WHERE time > now() - 60d GROUP BY TIME(1h)"));
        log.info("took {} ms result {}", System.currentTimeMillis() - begin, result);
    }
}

Since the amount of data is very large, a single query is already slow, so we won’t perform stress testing this time. After calling the two interfaces separately, we can see that the MySQL query takes about 29 seconds, while the InfluxDB query takes only 980ms:

[16:19:26.562] [http-nio-45678-exec-1] [INFO ] [o.g.t.c.n.i.PerformanceController:31 ] - took 28919 ms result [{date_format(time,'%Y%m%d%H')=2019121308, max(value)=9993, min(value)=4, avg(value)=5129.5639}, {date_format(time,'%Y%m%d%H')=2019121309, max(value)=9990, min(value)=12, avg(value)=4856.0556}, {date_format(time,'%Y%m%d%H')=2019121310, max(value)=9998, min(value)=8, avg(value)=4948.9347}, {date_format(time,'%Y%m%d%H')...

[16:20:08.170] [http-nio-45678-exec-6] [INFO ] [o.g.t.c.n.i.PerformanceController:40 ] - took 981 ms result QueryResult [results=[Result [series=[Series [name=m, tags=null, columns=[time, mean, min, max], values=[[2019-12-13T08:00:00Z, 5249.2468619246865, 21.0, 9992.0],...

On the case of aggregating data by time range, we can see the performance advantage of InfluxDB. However, we certainly cannot treat InfluxDB as a regular database, because:

InfluxDB does not support data update operations. After all, time data can only generate new data with time, and it is impossible to modify past data.

From a data structure perspective, time series data does not have a single primary key identifier. It must include a timestamp, and data can only be associated with timestamps. It is not suitable for regular business data.

In addition, it is worth noting that even when using InfluxDB to store time-related metric data, we should be cautious about misusing tags.

InfluxDB’s tag feature allows you to set multiple tags for each metric, and tags have indexes, allowing for conditional searches or grouping based on tags. However, tags can only store limited, enumerable tags and cannot store information such as URLs. Otherwise, it may cause high series cardinality issues, occupying a large amount of memory, or even causing OOM errors. You can click here to view the relationship between series and memory usage. For InfluxDB, we cannot save raw data like URLs to the database. We can only classify the data and save it with limited tags.

In summary, for MySQL, when dealing with a large amount of data, using full table scans to aggregate and calculate metric data is very slow and can only be used as a temporary solution. In this case, it is necessary to introduce a time series database like InfluxDB. Time series databases can be used as the main storage for specific scenarios (such as monitoring and statistics) or as auxiliary data sources to store metric data from business systems.

Taking the strengths of Elasticsearch and MySQL #

Elasticsearch (hereafter referred to as ES) is a popular distributed search and analysis database that is particularly suitable for full-text searches due to its unique inverted index structure.

In simple terms, an inverted index can be thought of as a Map where the keys are the segmented keywords and the values are lists of document IDs/fragment IDs. By inputting the desired search term, we can directly retrieve a list of document IDs/fragment IDs from this Map that contains the keyword. We can then query the actual document content based on these document IDs/fragment IDs.

Let’s test and compare the efficiency difference between performing full-text searches using ES and using the LIKE operator in MySQL.

First, let’s define an entity called “News” that includes fields such as news category, title, and content. This entity will be used as entities for both Spring Data JPA and Spring Data Elasticsearch:

@Entity
@Document(indexName = "news", replicas = 0) //@Document annotation defines this as an ES index named "news" without data redundancy
@Table(name = "news", indexes = {@Index(columnList = "cateid")}) //@Table annotation defines this as a MySQL table named "news" with an index on the "cateid" column
@Data
@AllArgsConstructor
@NoArgsConstructor
@DynamicUpdate
public class News {
    @Id
    private long id;
    @Field(type = FieldType.Keyword)
    private String category; // News category name
    private int cateid; // News category ID
    @Column(columnDefinition = "varchar(500)")//@Column annotation defines a field in MySQL, for example, it defines the "title" column as a varchar(500)
    @Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart")//@Field annotation defines the format of the ES field and uses the IK tokenizer for tokenizing
    private String title; // News title
    @Column(columnDefinition = "text")
    @Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart")
    private String content; // News content
}

Next, let’s implement the main program. When the program starts, we will load 4,000 news data from a CSV file and make 100 copies to create a total of 400,000 data records, which will be written to both MySQL and ES:

@SpringBootApplication
@Slf4j
@EnableElasticsearchRepositories(includeFilters = @ComponentScan.Filter(type = FilterType.ASSIGNABLE_TYPE, value = NewsESRepository.class)) // Explicitly set which is the ES Repository
@EnableJpaRepositories(excludeFilters = @ComponentScan.Filter(type = FilterType.ASSIGNABLE_TYPE, value = NewsESRepository.class)) // The others are MySQL Repositories
public class CommonMistakesApplication {
    public static void main(String[] args) {
        Utils.loadPropertySource(CommonMistakesApplication.class, "es.properties");
        SpringApplication.run(CommonMistakesApplication.class, args);
    }
    @Autowired
    private StandardEnvironment standardEnvironment;
    @Autowired
    private NewsESRepository newsESRepository;
    @Autowired
    private NewsMySQLRepository newsMySQLRepository;
    @PostConstruct
    public void init() {
        // Use -Dspring.profiles.active=init to initialize the program
        if (Arrays.stream(standardEnvironment.getActiveProfiles()).anyMatch(s -> s.equalsIgnoreCase("init"))) {
            // The original data in the CSV file only contains 4,000 records
            List<News> news = loadData();
            AtomicLong atomicLong = new AtomicLong();
            news.forEach(item -> item.setTitle("%%" + item.getTitle()));
            // We simulate a data size 100 times larger, which is 400,000 records
            IntStream.rangeClosed(1, 100).forEach(repeat -> {
                news.forEach(item -> {
                    // Reset the primary key ID
                    item.setId(atomicLong.incrementAndGet());
                    // Modify the title field slightly each time by adding a digit in front to indicate the number of copies
                    item.setTitle(item.getTitle().replaceFirst("%%", String.valueOf(repeat)));
                });
                initMySQL(news, repeat == 1);
                log.info("init MySQL finished for {}", repeat);
                initES(news, repeat == 1);
                log.info("init ES finished for {}", repeat);
            });
        }
    }
    // Parse the original data from news.csv
    private List<News> loadData() {
        // Use jackson-dataformat-csv to convert CSV to POJO
        CsvMapper csvMapper = new CsvMapper();
        CsvSchema schema = CsvSchema.emptySchema().withHeader();
        ObjectReader objectReader = csvMapper.readerFor(News.class).with(schema);
        ClassLoader classLoader = getClass().getClassLoader();
        File file = new File(classLoader.getResource("news.csv").getFile());
        try (Reader reader = new FileReader(file)) {
            return objectReader.<News>readValues(reader).readAll();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
    // Save the data to ES
    private void initES(List<News> news, boolean clear) {
        if (clear) {
            // Delete historical data when calling for the first time
            newsESRepository.deleteAll();
        }
        newsESRepository.saveAll(news);
    }
    // Save the data to MySQL

The article continues in a similar format, but this portion provides a good overview of the original content.

private void initMySQL(List<News> news, boolean clear) {

    if (clear) {

        // Delete historical data when called for the first time

        newsMySQLRepository.deleteAll();

    }

    newsMySQLRepository.saveAll(news);

}

}



Since we are using Spring Data, we can define two Repositories and directly define query methods without implementing any logic to achieve querying. Spring Data will generate the corresponding SQL statements and ES query DSL based on the method names. The translation logic for ES can be found here.

Here, we define a method countByCateidAndContentContainingAndContentContaining, which represents the query conditions: search category equals to the cateid parameter, and the content contains both keyword1 and keyword2. It calculates the total number of news items that meet the conditions.

@Repository

public interface NewsMySQLRepository extends JpaRepository<News, Long> {

    // JPA: Calculates the total number of news items that meet the conditions where the search category equals to the cateid parameter and the content contains both keyword1 and keyword2

    long countByCateidAndContentContainingAndContentContaining(int cateid, String keyword1, String keyword2);

}



@Repository

public interface NewsESRepository extends ElasticsearchRepository<News, Long> {

    // ES: Calculates the total number of news items that meet the conditions where the search category equals to the cateid parameter and the content contains both keyword1 and keyword2

    long countByCateidAndContentContainingAndContentContaining(int cateid, String keyword1, String keyword2);

}



For both ES and MySQL, we use the same conditions for searching: category is 1, and the keywords are "social" and "apple". We then output the search results and the time taken:

// Test MySQL search and output the time taken and the results

@GetMapping("mysql")

public void mysql(@RequestParam(value = "cateid", defaultValue = "1") int cateid,

                  @RequestParam(value = "keyword1", defaultValue = "社会") String keyword1,

                  @RequestParam(value = "keyword2", defaultValue = "苹果") String keyword2) {

    long begin = System.currentTimeMillis();

    Object result = newsMySQLRepository.countByCateidAndContentContainingAndContentContaining(cateid, keyword1, keyword2);

    log.info("took {} ms result {}", System.currentTimeMillis() - begin, result);

}



// Test ES search and output the time taken and the results

@GetMapping("es")

public void es(@RequestParam(value = "cateid", defaultValue = "1") int cateid,

               @RequestParam(value = "keyword1", defaultValue = "社会") String keyword1,

               @RequestParam(value = "keyword2", defaultValue = "苹果") String keyword2) {

    long begin = System.currentTimeMillis();

    Object result = newsESRepository.countByCateidAndContentContainingAndContentContaining(cateid, keyword1, keyword2);

    log.info("took {} ms result {}", System.currentTimeMillis() - begin, result);

}



By calling the respective interfaces, we can see that ES takes only 48 ms, while MySQL takes over 6 seconds, which is 100 times slower than ES. Unfortunately, although the news category ID has an index, this index can only speed up the filtering of the category ID, and is powerless when it comes to full-text search of the content using a B+ tree index.

Combining NoSQL and MySQL to cope with high-concurrency composite database architecture #

Now, through some cases, we have seen that NoSQL databases like Redis, InfluxDB, and Elasticsearch have their strengths and weaknesses in different scenarios. So, is there an all-round database available?

I believe there isn’t. Each storage system has its unique data structure, which determines its strengths and weaknesses in different scenarios.

For example, MySQL’s InnoDB engine’s B+ tree is friendly for sorting and range queries, and the cost of frequent data updates is not too high, making it suitable for OLTP (On-Line Transaction Processing).

Another example is Elasticsearch’s Lucene, which uses FST (Finite State Transducer) index + inverted index, has high space efficiency, and is suitable for indexing infrequently changing data and implementing full-text search. A storage system itself cannot use multiple data structures to store the same set of data, so it cannot be suitable for all scenarios.

Although in most business scenarios, the performance of MySQL is not too bad, for large-scale, high-traffic, and complex Internet applications, MySQL’s implementation of ACID (Atomicity, Consistency, Isolation, Durability) makes it heavy. Additionally, MySQL’s ability to horizontally scale and its limited functionality make it unable to handle all data volume and traffic, as well as all functional requirements. Therefore, through the means of architecture, we need to combine the use of multiple storage systems, complementing each other’s strengths and weaknesses, to achieve a “1+1>2” effect.

Let me give you an example. We have designed a system architecture that includes multiple database systems and can handle various high-concurrency scenarios in a set of data service systems. This architecture includes three parts: synchronous write service, asynchronous write service, and query service, which respectively implement primary database writing, auxiliary database writing, and query routing.

Let’s analyze this architecture based on its services.

Firstly, it should be clear that important business master data can only be stored in relational databases like MySQL for three reasons:

RDBMS has been verified over decades and has become very mature.

RDBMS has a large user base, which enables fast bug fixing, stable versions, and high reliability.

RDBMS emphasizes ACID, which ensures data integrity.

There are two types of query tasks that can be performed by MySQL, which can achieve good performance since it is MySQL’s strength:

Query by primary key ID. Directly querying the clustered index will result in high performance. However, once the data volume of a single table exceeds billions, the performance will also degrade. Additionally, a single database cannot handle huge query concurrency. Therefore, we can shard the data tables and evenly distribute them to multiple database instances. We refer to this database cluster as a Sharding cluster.
Range queries with various conditions to find primary key IDs. This involves querying the secondary index to obtain the primary key, which only requires querying a B+ tree and achieves high efficiency. However, it is not suitable to index large values, such as using varchar(1000) for indexing. Indexing foreign keys (usually int or bigint types) will achieve better performance. Therefore, we can create an “index table” in MySQL, which mainly saves foreign keys of various related tables and as few varchar fields as possible. Most columns in this index table can have secondary indexes for simple searches, resulting in a list of primary keys as search results instead of complete data. Since the index table has lightweight columns and a small number of them (usually within 10), even if the index table is not sharded, there won’t be significant issues.

As shown by the blue lines in the figure, the three steps of writing to two types of MySQL data tables and sending MQ messages are completed by a synchronous write service. As mentioned in “Asynchronous Processing,” all asynchronous processes need to be compensated, and the same applies here. However, for simplicity’s sake, I have omitted the compensation process here.

Then, as shown by the green line in the figure, there is an asynchronous write service that listens to MQ messages and continues to perform the update operations on auxiliary data. Here, we have chosen ES and InfluxDB as the auxiliary databases. Therefore, the entire asynchronous write data operation consists of three steps:

The MQ message may not contain complete data or may only contain the primary key ID of the latest data. We need to query the complete data based on the ID from the query service.
The data written to InfluxDB can generally be aggregated based on time intervals and written to InfluxDB periodically. Therefore, a simple client-side aggregation is performed here, followed by writing to InfluxDB.
ES is not suitable for performing join operations between different indexes. It is suitable for storing denormalized data. For example, we can embed information such as user, merchant, and product lists under an order into the entire order JSON as nested objects and store the flattened JSON directly in ES.

For data writing operations, we consider the synchronous data to be successfully written when the operation returns. However, due to various reasons, asynchronous data writing cannot be guaranteed to be immediately successful and may have some latency, such as:

The loss of asynchronous messages, which requires compensation processing.
The indexing operation itself in ES is relatively slow.
The data written to InfluxDB requires client-side periodic aggregation.

Therefore, for the query service, as shown by the red line in the figure, we need to route the request to the appropriate database based on certain contextual conditions, such as query consistency requirements, timeliness requirements, search conditions, fields to be returned, search time range, etc., and perform some aggregation processing:

For querying single data based on primary key, we can query from MySQL Sharding cluster or Redis. If real-time requirements are not high, we can also query from ES.
In scenarios where orders need to be searched based on multiple conditions, we can query the list of primary keys from the MySQL index table, then retrieve the data details based on the primary keys from the MySQL Sharding cluster or Redis.
Various backend systems may need to use more complex search conditions, such as full-text search, to query order data, or scheduled analysis tasks may need to query large amounts of data at once. These scenarios do not require high data real-time requirements and can be searched in ES. Furthermore, data in MySQL can be archived, and we can keep historical data in ES. Generally, the concurrency for querying historical data won’t be very high, so it can be uniformly routed to ES for queries.
Monitoring systems or backend reporting systems may need to present business monitoring charts or tables, and requests can be routed to InfluxDB for queries.

Key points review #

Today, I compared the performance of cache database Redis, time series database InfluxDB, search database ES, and MySQL through three cases. Here’s what we observed:

Redis has much higher read performance for individual data compared to MySQL, but it is not suitable for range searches.

InfluxDB has much higher efficiency in aggregating time series data compared to MySQL, but it is not a general-purpose database because it lacks a primary key.

ES has much higher full-text search capability for keywords compared to MySQL, but the efficiency of updating fields is lower, making it unsuitable for storing frequently updated data.

Finally, we presented an architectural solution that combines MySQL + Redis + InfluxDB + ES, fully leveraging the strengths of various databases, and working together to form a storage architecture capable of handling complex queries and high-concurrency read and write operations.

The master data consists of two types of MySQL tables, where the index table is responsible for simple conditional searches to obtain the primary key, and the sharding table handles high-concurrency primary key queries. The master data is written by a synchronous write service, which sends out MQ messages after writing.

The auxiliary data can use a suitable NoSQL according to requirements, and can be asynchronously written by one or more separate asynchronous write services listening to the MQ.

A unified query service is used to handle all query requirements, routing queries to the appropriate storage based on different query needs, ensuring that each storage system can leverage its strengths according to the scenario, and distribute the query load across different database systems.

The code used today is all available on GitHub. You can click on this link to view it.

Reflection and Discussion #

We mentioned that InfluxDB cannot have too many tags. Can you write a piece of test code to simulate this problem and observe the memory usage of InfluxDB?

MongoDB, a document database, is also a popular NoSQL database. What do you think are the advantages and disadvantages of MongoDB? In what scenarios is it suitable to use MongoDB?

Regarding data storage, do you have any other insights? I am Zhu Ye, and I welcome you to leave a comment in the comment section to share your thoughts. You are also welcome to share today’s content with your friends or colleagues for further discussion.