17 How to Correctly Display Random Messages

17 How to Correctly Display Random Messages #

In my previous article, I explained several execution modes of the order by statement. While writing that article, I thought of a performance issue that a friend who was making an English learning app had encountered before. In today’s article, I will start with this performance issue and talk to you about another sorting requirement in MySQL, hoping to deepen your understanding of MySQL sorting logic.

This English learning app has a feature on its homepage that displays random words. In other words, based on the level of each user, there is a word list for that user, and every time the user visits the homepage, three random words will be displayed. They found that as the word list grew larger, the logic of selecting words became slower and even affected the speed of opening the homepage.

Now, if you were to design this SQL statement, how would you write it?

To make it easier to understand, I have simplified this example: I have removed the logic that each level of user has a corresponding word list, and directly select three words randomly from a word list. The create table statement and the initial data command for this table are as follows:

mysql> CREATE TABLE `words` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(64) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
 
delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=0;
  while i<10000 do
    insert into words(word) values(concat(char(97+(i div 1000)), char(97+(i % 1000 div 100)), char(97+(i % 100 div 10)), char(97+(i % 10))));
    set i=i+1;
  end while;
end;;
delimiter ;

call idata();

For the sake of quantification, I have inserted 10,000 rows of records into this table. Next, let’s take a look at how to randomly select 3 words, what methods can be used, what problems exist, and how to improve them.

Memory Temporary Tables #

Firstly, you would think of using order by rand() to achieve this logic.

mysql> select word from words order by rand() limit 3;

The meaning of this statement is straightforward - randomize the sorting and select the top 3. Although the writing style of this SQL statement is simple, the execution process is a bit complicated.

Let’s use the explain command to see how this statement is executed.

img

Figure 1: Using the explain command to view the execution of the statement.

The Extra field shows Using temporary, indicating that a temporary table is required, and Using filesort, indicating that a sorting operation is required.

Therefore, the meaning of this Extra is that a temporary table is required and it needs to be sorted on the temporary table.

Here, you can review the content of the [previous article] on full-column sorting and rowid sorting. I’m pasting the two flowcharts from the previous article here to make it easier for you to review.

img

Figure 2: Full-column sorting

img

Figure 3: Rowid sorting

Next, let me ask you a question. For sorting the temporary memory table, which algorithm do you think it will choose? Reviewing one of the conclusions from the previous article: For InnoDB tables, performing a full-column sorting reduces disk access, so it will be given priority.

I emphasized “InnoDB tables.” You must have thought of it - for memory tables, the process of returning data only involves directly accessing the memory according to the position of the data row, without causing additional disk access. The optimizer no longer has this concern, so what it prioritizes is that the rows used for sorting should be as small as possible. Therefore, MySQL will choose rowid sorting at this time.

After understanding the logic behind this algorithm selection, let’s take a look at the execution process of the statement. At the same time, through this example today, let’s try to analyze the number of scanned rows in the statement.

The execution process of this statement is as follows:

  1. Create a temporary table. This temporary table uses the memory storage engine, has two fields, and is reminiscent of convenience. The first field is of type double and will be referred to as field R for ease of description. The second field is of type varchar(64) and will be referred to as field W. This table does not have an index.
  2. From the words table, retrieve all the word values in primary key order. For each word value, call the rand() function to generate a random decimal between 0 and 1, and store this random decimal and the word into fields R and W of the temporary table, respectively. At this point, the number of scanned rows is 10,000.
  3. Now the temporary table has 10,000 rows of data. Next, you need to sort this temporary memory table by field R, which does not have an index.
  4. Initialize the sort_buffer. The sort_buffer has two fields, one is of type double, and the other is of type integer.
  5. Retrieve R value and position information (I will explain later why it is “position information”) from the memory temporary table, and store them in the two fields of the sort_buffer one by one. This process requires a full table scan of the memory temporary table. At this point, the number of scanned rows increases to 20,000.
  6. Sort the sort_buffer according to the value of R. Note that this process does not involve table operations, so it will not increase the number of scanned rows.
  7. After the sorting is completed, retrieve the position information of the top three results, and retrieve the word values from the memory temporary table one by one, and return them to the client. During this process, the table’s three rows are accessed, and the total number of scanned rows becomes 20,003.

Next, let’s verify the number of scanned rows we derived from our analysis through the slow query log.

# Query_time: 0.900376  Lock_time: 0.000347 Rows_sent: 3 Rows_examined: 20003
SET timestamp=1541402277;
select word from words order by rand() limit 3;

In this query, Rows_examined: 20003 indicates that this statement scanned 20003 rows during the execution process, which confirms our analysis.

As a side note, during the process of studying concepts, you can often do this - first analyze and calculate the number of scanned rows based on the principles, and then verify your conclusions by checking the slow query log. I often do this myself, and it is very interesting. I feel happy when my analysis is correct, and I also feel happy when I analyze incorrectly but understand the reasons.

Now, let me draw the complete diagram of the random sorting execution process.

img

Figure 4 Complete Flowchart of Random Sorting Process 1

The “pos” in the figure represents the position information. You may find it strange, what is the concept of “position information” here? In the previous article, when we sorted the InnoDB table, we used the ID field.

At this point, we need to go back to a basic concept: How does MySQL locate a “row of data” in a table?

In the previous fourth and fifth articles about index introduction, some students asked, if the primary key of an InnoDB table is deleted, does that mean there is no primary key and no way to perform an index lookup?

Actually, that’s not the case. If you create a table without a primary key, or delete the primary key of a table, InnoDB will automatically generate a rowid of 6 bytes in length as the primary key.

This is where the name “rowid” comes from in the sorting mode. In fact, it represents the information that each engine uses to uniquely identify a data row.

  • For an InnoDB table with a primary key, this rowid is the primary key ID;
  • For an InnoDB table without a primary key, this rowid is generated by the system;
  • The MEMORY engine is not an index-organized table. In this example, you can consider it an array. Therefore, this rowid is actually the index of the array.

At this point, let me briefly summarize: order by rand() uses an in-memory temporary table, and the rowid sorting method is used when sorting the in-memory temporary table.

Disk-based Temporary Table #

So, are all temporary tables in memory?

Actually, that’s not the case. The tmp_table_size setting limits the size of in-memory temporary tables, with a default value of 16MB. If the size of the temporary table exceeds tmp_table_size, the in-memory temporary table will be converted to a disk-based temporary table.

The default engine used for disk-based temporary tables is InnoDB, controlled by the internal_tmp_disk_storage_engine parameter.

When using a disk-based temporary table, it corresponds to the sorting process of an InnoDB table without explicit indexes.

To reproduce this process, I set tmp_table_size to 1024, sort_buffer_size to 32768, and max_length_for_sort_data to 16.

set tmp_table_size=1024;
set sort_buffer_size=32768;
set max_length_for_sort_data=16;
/* Enable optimizer_trace, only valid for this session */
SET optimizer_trace='enabled=on'; 
 
/* Execute the statement */
select word from words order by rand() limit 3;
 
/* View OPTIMIZER_TRACE output */
SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G

img

Figure 5 Partial Result of OPTIMIZER_TRACE

Then, let’s take a look at the result of OPTIMIZER_TRACE.

Since max_length_for_sort_data is set to 16, which is smaller than the length of the “word” field, we see in the sort_mode that rowid sorting is used. This is expected, as the sorting process involves rows composed of random values from the “R” field and the rowid field.

At this point, you may have done a rough calculation in your mind and found that something is not right. The “R” field stores random values of 8 bytes, and the rowid is 6 bytes (as for why it is 6 bytes, I’ll leave that for you to contemplate later). The total number of rows is 10,000, which adds up to a total of 140,000 bytes, exceeding the sort_buffer_size of 32,768 bytes. However, the value of number_of_tmp_files is unexpectedly 0. Does this mean that no temporary file is needed?

Indeed, the sorting of this SQL statement did not use a temporary file. Instead, it adopted a new sorting algorithm introduced in MySQL 5.6 - the priority queue sorting algorithm. Next, let’s see why the temporary file-based sorting algorithm, also known as the merge sort algorithm, was not used in this case, and why the priority queue sorting algorithm was used instead.

In fact, for our current SQL statement, we only need to retrieve the three rowids with the smallest R values. However, if the merge sort algorithm were used, although we would eventually obtain the top 3 values, this algorithm would sort all 10,000 rows of data.

In other words, the remaining 9,997 rows would also be sorted. However, our query does not require these remaining rows to be sorted. So, if you think about it, it becomes clear that this approach wastes a significant amount of computational work.

On the other hand, the priority queue algorithm can accurately obtain only the three smallest values. The execution process is as follows:

  1. For the 10,000 (R, rowid) pairs to be sorted, the first three rows are taken to construct a heap;

  2. Each subsequent pair is compared with the root of the heap. If it is smaller, it replaces the root and adjusts the heap structure accordingly;

  3. After all 10,000 pairs have been processed, the heap contains the three smallest values.

The advantage of this algorithm is that it only needs to store the smallest k values (in this case, k = 3) in the heap without explicitly sorting all the data.

In conclusion, the priority queue sorting algorithm can efficiently obtain the top k values without sorting the entire dataset, which is why it was used in this scenario. (For students who are unfamiliar with data structures, you can imagine this as an array consisting of three elements)

  1. Take the next row (R’, rowid’), compare it with the largest R in the current heap. If R’ is smaller than R, remove the (R, rowid) from the heap and replace it with (R’, rowid’);
  2. Repeat step 2 until the 10,000th (R’, rowid’) is compared.

Here is a simple diagram illustrating the process of sorting in a priority queue.

img

Figure 6: Example of priority queue sorting algorithm.

Figure 6 simulates the process of finding the three rows with the smallest R values among six (R, rowid) rows using priority queue sorting. Throughout the sorting process, in order to quickly obtain the maximum value in the current heap, the maximum value is always maintained at the top of the heap, making it a max heap.

In the OPTIMIZER_TRACE result from Figure 5, if chosen=true in the filesort_priority_queue_optimization section, it means the priority queue sorting algorithm is used, which does not require temporary files. Therefore, the corresponding number_of_tmp_files is 0.

After this process is completed, the heap we construct will contain the three rows with the smallest R values among the 10,000 rows. Then, we take out their rowids one by one and retrieve the word field from the temporary table, which is the same process as the rowid sorting process in the previous article.

Let’s take a look at the SQL query statement in the previous article:

select city, name, age from t where city='Hangzhou' order by name limit 1000;

You may wonder why the priority queue sorting algorithm is not used here even though it includes the ’limit’ clause. The reason is that this SQL statement uses ’limit 1000’, and if the priority queue algorithm is used, the size of the heap to be maintained would be the (name, rowid) of 1000 rows, which exceeds the sort_buffer_size I set. Therefore, the merge sorting algorithm must be used.

In conclusion, regardless of the type of temporary table used, the ‘order by rand()’ syntax will make the computation process very complex and require a large number of scanned rows. Therefore, the resource consumption of the sorting process will also be considerable.

Returning to the question at the beginning of our article, how can we correctly randomize the order?

Random Sorting Methods #

Let’s simplify the problem first: how can we randomly select only one word value? The idea is as follows:

  1. Obtain the maximum value M and minimum value N of the primary key id in this table.
  2. Generate a random number X between the maximum and minimum values using a random function: X = (M-N)*rand() + N.
  3. Retrieve the first row with an ID greater than or equal to X.

We can temporarily call this algorithm Random Algorithm 1. Here is the sequence of execution statements:

mysql> select max(id), min(id) into @M, @N from t;
set @X = floor((@M-@N+1)*rand() + @N);
select * from t where id >= @X limit 1;

This method is efficient because retrieving the max(id) and min(id) does not require scanning the index, and the third step of select can also be quickly located using an index, so it can be assumed that only 3 rows are scanned. However, this algorithm itself does not strictly meet the randomness requirement of the problem since there may be gaps in the IDs, and therefore, the probabilities of selecting different rows are not the same, which is not truly random.

For example, if you have 4 IDs, namely 1, 2, 4, and 5, then according to the above method, the probability of selecting the row with id=4 is twice as likely as selecting any other row.

What if the four rows have IDs 1, 2, 40000, and 40001? In this case, this algorithm can be considered a bug.

Therefore, to obtain strictly random results, you can use the following process:

  1. Obtain the number of rows in the entire table and denote it as C.
  2. Obtain Y = floor(C * rand()). The role of the floor function here is to take the integer part.
  3. Use limit Y,1 to retrieve one row.

We can call this algorithm Random Algorithm 2. The following code is the sequence of execution statements for the above process:

mysql> select count(*) into @C from t;
set @Y = floor(@C * rand());
set @sql = concat("select * from t limit ", @Y, ",1");
prepare stmt from @sql;
execute stmt;
DEALLOCATE prepare stmt;

Since the parameter after limit cannot directly accept variables, I used the prepare+execute method in the code above. You can also write the method of concatenating SQL statements in the application, which would be simpler.

Random Algorithm 2 solves the clear uneven probability problem in Algorithm 1.

The way MySQL handles limit Y,1 is to read one by one in order, discard the first Y rows, and return the next record as the result. Therefore, this step requires scanning Y+1 rows. In addition, the first step scans C rows, so a total of C+Y+1 rows need to be scanned, which is more expensive than the cost of Algorithm 1.

Of course, Random Algorithm 2 has much lower execution costs compared to directly using order by rand().

You might wonder, if we calculate based on this table having 10,000 rows, with C=10,000, if we randomly select a larger Y value, then the number of scanned rows would be close to 20,000, similar to the number of scanned rows with order by rand(). So why do we say that Random Algorithm 2 has much lower costs? I’ll leave this question for you to contemplate after class. Now, let’s take a look at how we can randomly select 3 word values according to the approach of random algorithm 2. Here’s how you can do it:

  1. Get the number of rows in the entire table and store it as C.
  2. Use the same random method to obtain Y1, Y2, and Y3.
  3. Execute three “limit Y, 1” statements to retrieve three rows of data.

We refer to this algorithm as random algorithm 3. The following code represents the sequence of statements for executing the above process.

mysql> select count(*) into @C from t;
set @Y1 = floor(@C * rand());
set @Y2 = floor(@C * rand());
set @Y3 = floor(@C * rand());
select * from t limit @Y1, 1; // Retrieve the value of Y1, Y2, and Y3 in the application code, concatenate SQL, and execute
select * from t limit @Y2, 1;
select * from t limit @Y3, 1;

Summary #

In today’s article, I introduced the execution process of temporary table sorting in MySQL by using the requirement of random sorting.

If you directly use “order by rand()”, this statement will require “Using temporary” and “Using filesort”, which often incurs a large query execution cost. Therefore, you should avoid using this approach when designing your system.

In the example today, we didn’t just solve the problem internally within the database, but also let the application code help concatenate SQL statements. In actual application scenarios, it is common to adhere to the practice of placing business logic in the application code, and letting the database focus on “reading and writing data” only. Therefore, the application of these methods is still quite extensive.

Lastly, I’ll leave you with a challenge.

The total number of scanned rows for the random algorithm 3 described above is C + (Y1+1) + (Y2+1) + (Y3+1). In fact, it can still be further optimized to reduce the number of scanned rows.

My question is, as a developer working on this requirement, how would you reduce the number of scanned rows? Share your solution and explain the number of scanned rows it requires in the comment section. I will discuss this question with you at the end of the next article. Thank you for reading, and please feel free to share this article with more friends to read together.

Previous Question #

In the last article, the question I left you with was about the SQL statement “select * from t where city in (‘Hangzhou’,‘Suzhou’) order by name limit 100;”. Does this SQL statement need sorting? Are there any solutions to avoid sorting?

Although there is a composite index (city, name) and the names within each city are ordered incrementally, this SQL statement is not querying a single city’s values individually. It is querying both “Hangzhou” and “Suzhou” simultaneously, so all the names meeting the condition may not be ordered incrementally. Therefore, this SQL statement requires sorting.

How do we avoid sorting?

Here, we can make use of the characteristics of the composite index (city, name) and split this statement into two separate ones. The execution process is as follows:

  1. Execute “select * from t where city=‘Hangzhou’ order by name limit 100;”. This statement does not require sorting, and the client-side can store the results in a memory array A with a length of 100.
  2. Execute “select * from t where city=‘Suzhou’ order by name limit 100;”. Using the same method, assume the results are stored in a memory array B.
  3. Now, A and B are two sorted arrays. Then, using the merge-sort algorithm, you can obtain the smallest 100 values of names, which is the desired result.

If we change the “limit 100” in this SQL statement to “limit 10000, 100”, the handling method is similar. We need to modify the two statements as follows:

select * from t where city='Hangzhou' order by name limit 10100;

and

select * from t where city='Suzhou' order by name limit 10100.

When dealing with a large amount of data, we can simultaneously start two connections to read the results row by row. Then, using the merge-sort algorithm, we can obtain the names from the 10,001st to the 10,100th position in order, which will be the desired result.

Of course, this solution has an obvious drawback, which is that the amount of data returned from the database to the client-side increases.

Therefore, if the size of the data per row is large, consider modifying these two SQL statements to the following format:

select id, name from t where city='Hangzhou' order by name limit 10100;

and

select id, name from t where city='Suzhou' order by name limit 10100.

Then, use the merge-sort method to obtain the name and id values in order from the 10,001st to the 10,100th position. With these 100 ids, you can query all the records from the database.