30 Application Development Analysis of Beijing Minibus Oil Tanker's Shaking Number Trend

30 Application Development - Analysis of Beijing Minibus Oil Tanker’s Shaking Number Trend #

Hello, I’m Wu Lei.

If you are also living in Beijing, the car license plate lottery is probably something that concerns you as well. Many people around me have been complaining, saying, “The car license plate lottery is too difficult, it seems hopeless and unpredictable. It feels even less reliable than buying lottery tickets.”

To be honest, I have been participating in the car license plate lottery myself, even with 8 times the probability, but I still haven’t been selected. Therefore, many people, including myself, want to know why the lottery is so difficult. How many times does an average person have to participate in the lottery to be selected? Is the change in the selection rate really consistent with the official announcement? Can increasing the probability actually improve the chances of being selected?

We can answer these questions by developing an application that analyzes the car license plate lottery trend in Beijing. I will guide you through the development of this application in two sessions. In this process, we can practice most of the knowledge we have learned in the previous chapters on principles, general optimization, and Spark SQL optimization. Don’t you just feel excited to hear that?

Without further ado, let’s get started!

Preparing for the Class #

Since we’re doing development, we need to do some preparation work. The preparation work is divided into three parts: preparing the data, preparing the development environment, and preparing the runtime environment.

Preparing the Data #

The data required for the application has already been prepared for you and uploaded to the cloud storage. You can click this link and enter the extraction code ajs6 to download it.

The filename of the data is “2011-2019_automobile_lottery_data.tar.gz”. After extracting it, the directory structure will be as shown in the following diagram. Under the root directory, there are two subdirectories: “apply” and “lucky”. The “apply” directory contains application numbers for each lottery batch from 2011 to 2019, while the “lucky” directory contains the application numbers that were selected in each batch. For convenience, we will refer to the people who applied for the lottery as “applicants” and the people who were selected as “lucky winners”. The next level subdirectories under “apply” and “lucky” correspond to the lottery batches, and each lottery batch directory contains Parquet-format data shards.

The “apply” and “lucky” subdirectories logically correspond to the fact table and dimension table, which can also be called the “applicant table” and “lucky winner table”. The schemas for both tables are (batchNum, carNum), which represents the lottery batch number and application number. In summary, both the fact table and dimension table are stored with partitioned designs, and the partition key is batchNum.

Preparing the Development Environment #

After downloading and extracting the data, let’s prepare the development environment. First, let’s talk about the programming language. To complete the development of the “Trend Analysis Application”, you can use any of the three languages: Python, Java, or Scala, based on your personal preference. Since I am accustomed to using Scala for development, the entire project’s code is implemented in Scala. If you are a Java or Python developer, there is no need to worry. With the explanation of the application logic and reference implementation in Scala, I believe you can complete the development of the application quickly.

The “Trend Analysis Application” is very lightweight, and the reference implementation in Scala is no more than 200 lines of code. Therefore, we can quickly implement it using a simple text editor like Sublime or Vi alone. However, to improve development efficiency and facilitate packaging and deployment of the application, I still recommend using an integrated development environment (IDE), such as IntelliJ IDEA, Eclipse, or IntelliJ PyCharm. The choice of IDE follows the same principle as the programming language - just choose the one that you are most comfortable with.

Preparing the Runtime Environment #

Lastly, we need to prepare the runtime environment. Since our application is lightweight and the data size is small, with the total size of the Parquet files after extraction being less than 4GB, you can even run the application from start to finish on your laptop or desktop PC. Choosing a “lightweight” approach is mainly to consider that many students may have difficulty setting up a distributed physical cluster, and we want to ensure that these students can participate in the hands-on practice without being limited by hardware.

However, the original intention and focus of these two classes are on performance tuning practices, and optimizing network overhead is a key part of it. Therefore, for students who have the conditions, I still encourage you to set up a distributed physical cluster or use a cloud-native distributed environment. Firstly, such a distributed environment is closer to the real situation in actual work, and secondly, the performance differences before and after tuning will be more significant, which will help you deepen your understanding of the effects and effectiveness of different tuning techniques.

I have selected three physical nodes for myself, and their resource configurations are as follows. In fact, you do not need such powerful machine configurations to run the application and perform performance comparisons. I did this mainly for the sake of execution efficiency. Because I only need to obtain the comparative results before and after tuning to demonstrate the effects and effectiveness of different tuning techniques, this configuration can reduce my waiting time.

Application Development #

After preparing the data, development environment, and execution environment, we can start developing the “Trend Analysis Application”. In order to answer various confusions about the car license plate lottery, in this application, we mainly analyze the following cases:

How many people participated in the lottery from 2011 to the end of 2019?
What is the distribution of lottery times, and what are the distributions of applicants and winners?
What is the trend of winning rate?
Has the winning rate changed significantly? How can we gain insight into it?
Are people with higher multiplier more likely to win?

Next, we will clarify the calculation logic for these cases one by one, and implement the code.

Case 1: Participant Count #

First, we need to have a basic understanding of the data. Let’s start with a simple count, which is to count the total number of participants and the number of lucky winners by the end of 2019. This requirement is straightforward, we only need to use the Parquet API to read the source file, create DataFrame, and then call count().

val rootPath: String = _

// Applicant data (due to the multiplier, the same person may have multiple numbers in each period)
val hdfs_path_apply = s"${rootPath}/apply"
val applyNumbersDF = spark.read.parquet(hdfs_path_apply)
applyNumbersDF.count
// Lucky winners data
val hdfs_path_lucky = s"${rootPath}/lucky"
val luckyDogsDF = spark.read.parquet(hdfs_path_lucky)
luckyDogsDF.count

By running this code in spark-shell or deploying it to a distributed environment, we can quickly obtain the calculation results. By the end of 2019, the total number of participants in the lottery was 381,972,118, which is 380 million people; the number of winners was 1,150,828, which is 1.15 million people. You may be curious, “Why is the total number of participants in the lottery so high?”

This is not surprising. First, the same person may participate in multiple rounds of lottery. For example, I have participated in at least 60 rounds of lottery (so hard!). Moreover, since 2016, the car license plate lottery has introduced the concept of multiplier. The design intention of the multiplier is to proportionally increase the winning probability of applicants with the increase of the number of participating rounds. That is to say, a person who has participated in 60 rounds of lottery has a higher winning probability than a person who has only participated in 10 rounds of lottery. However, the official implementation of the multiplier is relatively simple and straightforward. If you observe the batch files under the apply directory after 2016, you will find that the so-called multiplier is actually the number of copies of the application number.

It is precisely because of the above two reasons that the total number of people participating in the lottery is 380 million. If we exclude the multiplier factor, what would be the actual scale of the lottery?

val applyDistinctDF = applyNumbersDF.select("batchNum", "carNum").distinct
applyDistinctDF.count

By deduplicating and counting based on (batchNum, carNum), we can get the actual scale of the lottery, which is 135,009,819, or 135 million people. This means that from 2011 to 2019, within these 9 years, 135 million people participated in a “lottery game”, but only 1.15 million people were lucky enough to win the lottery. This shows how difficult it is to win in the lottery.

Case 2: Distribution of Lottery Times #

Next, we will further drill down and explore the distribution of lottery times for different groups, that is, how many times all applicants participated in the lottery, and how many times the winners had to participate in the lottery to win. For these two statistics, we need to eliminate the influence of the multiplier. In other words, the same application number should only keep one copy in the same batch. Therefore, we need to use the deduplicated “applicant table”: applyDistinctDF.

Scenario 1: Participants in the Lottery #

First, let’s analyze the distribution of all applicants, including the winners. According to the just introduced “business requirements”, we can quickly write the corresponding query statement.


val result02_01 = applyDistinctDF
  .groupBy(col("carNum"))
  .agg(count(lit(1)).alias("x_axis"))
  .groupBy(col("x_axis"))
  .agg(count(lit(1)).alias("y_axis"))
  .orderBy("x_axis")

result02_01.write.format("csv").save("_")


Executing the code above, we will get the calculation results shown in the following figure. 
In the figure, the x-axis represents the number of times the applicant has participated in the lottery, and the y-axis represents the corresponding number of participants. 
From 2011 to 2013, the lottery was held once a month. 
From 2014 onwards, however, the lottery was held once every two months. 
Therefore, by the end of 2019, there have been a total of 72 (12 * 3 + 6 * 6) lotteries. 
So, we can see that the values of the x-axis range from 1 to 72, with 1 representing those who have participated in the lottery once, and 72 representing those unfortunate individuals who have participated in the lottery 72 times.

![](../images/8ae349904a5343259c3cd44240d69896.jpg)

From the figure, we can easily observe that as the number of lottery draws increases, the distribution of participants gradually decreases. 
What does this mean? This means that every two months, new graduates from driving schools join the large army of lottery participants. 
If we observe the left half of the graph carefully, we will find that the number of participants in lottery draws that are multiples of 3 is often double or even more than that of their "neighbors." 
Why is that?

As mentioned earlier, starting from 2014, the lottery was held once every two months. 
Therefore, a difference of 3 in the number of lottery draws means that there is a time gap of approximately half a year between two draws. 
For example, a person who has participated in the lottery 3 times joined the forces half a year later than a person who has participated 6 times in the lottery. 
So, what does half a year mean? Let's speculate a bit. Although there are driving school graduates every month, the peak period of producing large numbers of students is during winter and summer vacations, which happen to be around half a year apart. Does my speculation make sense to you?

#### Scenario 2: Lucky Winners

Next, let's take a look at how lucky the lucky winners really are. 
To obtain the number of lottery draws for the lucky winners, we need to perform an inner join between the `applyDistinctDF` and `luckyDogsDF` tables. 
Afterwards, we can perform grouping and aggregation. The code implementation is shown in the following table.

val result02_02 = applyDistinctDF .join(luckyDogsDF.select(“carNum”), Seq(“carNum”), “inner”) .groupBy(col(“carNum”)).agg(count(lit(1)).alias(“x_axis”)) .groupBy(col(“x_axis”)).agg(count(lit(1)).alias(“y_axis”)) .orderBy(“x_axis”)

result02_02.write.format(“csv”).save("_")

Executing the above code, we will obtain the calculation results shown in the following figure. The meanings of the x-axis and y-axis are the same as in Scenario 1, representing the number of lottery draws and the corresponding distribution of participants, respectively. We can observe that as the number of lottery draws increases, the distribution of participants decreases monotonically. In other words, the more lottery draws a person has, the fewer lucky winners there are. One reason I can think of is that as the number of lottery draws increases, the number of participants decreases. This was already confirmed in Scenario 1. This is not difficult to understand – those who persistently participate in more than 60 draws are truly hardcore players. Therefore, the number of participants is naturally reduced, resulting in fewer lucky winners.

However, if we assume that applicants participate in the lottery once every two months, then we will come to a painful conclusion: lottery winners usually do not have to wait for too long. The majority of them are selected within 2-3 years, as the first half accounts for the vast majority. On the other hand, those who have to wait for more than 3 years to be selected become the minority among the lucky winners. This reminds me of a joke we used to make: “If you have enough luck, you would have been selected long ago. If it takes you more than 3 years to be selected, it means you have insufficient luck, and you shouldn’t count on the lottery in the future.”

Case 3: Change in the Winning Rate #

From the distribution of lottery draws, we can see that the trends of applicants and lucky winners are consistent. Does this mean that the ratio of the two is stable? The ratio is actually the winning rate. Next, let’s explore the change in the winning rate. To calculate the winning rate, we need to separately count the number of applicants and lucky winners for each lottery draw. We can then join and aggregate these counts. The code implementation is shown below.


// Count the number of applicants for each batch
val apply_denominator = applyDistinctDF
  .groupBy(col("batchNum"))
  .agg(count(lit(1)).alias("denominator"))

// Count the number of lucky winners for each batch
val lucky_molecule = luckyDogsDF
  .groupBy(col("batchNum"))
    .filter(col("batchNum").like("2016%"))
    .groupBy(col("applyNum"))
    .agg(max(col("luckyDrawNum")).alias("maxMultiplier"))
    .groupBy(col("maxMultiplier"))
    .agg(count(lit(1)).alias("count"))
    .orderBy("maxMultiplier")
    
    result05_01.write.format("csv").save("_")
    

![](../images/4e44e0c7e7b849b2809a5e9c08eef633.jpg)

从图中可以看出，大多数人都是以10倍的倍率中签，其他倍率的中签人数逐渐减少。可以看到，中签率虽然很低，但是幸运的中签者中，有一些人中签的倍率达到了50倍以上。在这些高倍率的中签者中，是不是大家都是“暴富”啊？我们回头再看。

#### 场景2：不同倍率下的中签率

这次我们研究的是，不同倍率下的中签率。我们可以算一下这样一个中签率表：对于每个大于等于1倍的倍率，统计一下对应倍率的中签者个数和控制总人数。这样我们就可以得到，不同倍率下的“中签个数”的占比，也就是中签率。

根据这样的逻辑，我们的代码实现如下。
    
    
    // 获取不同倍率下的中签个数和总人数，计算中签率
    val result05_02 = applyNumbersDF
    .filter(col("batchNum").like("2016%"))
    .groupBy(col("maxMultiplier"))
    .agg(count(lit(1)).alias("luckyCount"))
    .withColumn("total", sum(col("luckyCount")).over(Window.orderBy(col("maxMultiplier"))))
    .select(col("maxMultiplier"), col("total"), col("luckyCount"), round(col("luckyCount")/col("total"), 5).alias("ratio"))
    
    result05_02.write.format("csv").save("_")
    

![](../images/15c04f21e04142778f8e1f053e9b54ef.jpg)

从图中可以看出，不同倍率下的中签率呈现出明显的非线性趋势。当倍率不大于10倍时，中签率随倍率的增加而显著增加。当倍率大于10倍时，中签率在比较高水平上稳定分布。总的来说，倍率在1到10倍之间的人群中签率在0.07%到1.4%之间，而倍率在10倍以上的人群中签率更加高达2%-4%。

小得可怜的倍率，却展现出如此巨大的效应，这就是股票市值一天涨几百亿的老哥们的力量。所谓“真金不怕火炼”，理想和信念，总会有一天发光发热，硬核加油。
    .join(luckyDogsDF.filter(col("batchNum") >= "201601")
    .select("carNum"), Seq("carNum"), "inner")
    .groupBy(col("batchNum"),col("carNum"))
    .agg(count(lit(1)).alias("multiplier"))
    .groupBy("carNum")
    .agg(max("multiplier").alias("multiplier"))
    .groupBy("multiplier")
    .agg(count(lit(1)).alias("cnt"))
    .orderBy("multiplier")
     
    result05_01.write.format("csv").save("_")
    

The distribution of winners' multipliers is shown in the following figure. The x-axis represents the multipliers of the winners, more precisely, it is the maximum number of copies that a winner has in the lottery batches they participated in. The y-axis represents the distribution of the number of people. By observing the execution results, we can easily find that the multipliers of the winners show an obvious normal distribution. Therefore, from this graph we can draw a preliminary conclusion: to win the lottery, you do not need a very high multiplier. In other words, the role and contribution of the multiplier in winning the lottery is not linearly increasing.

![](../images/35dab7926d2f47e9b2aeea451de765e1.jpg)

However, similar to case 2, there is also an issue of population size here. That is to say, there are naturally fewer people with high multipliers, and therefore, the number of winners among them is also small. Therefore, we need to combine the multiplier distribution of the applicants, calculate the percentage of winners at different multipliers, in order to draw a more comprehensive conclusion about the contribution of the multiplier.

#### Scenario 2: Winning Ratio at Different Multipliers

After having a preliminary understanding of the multiplier distribution, let's calculate the winning ratio among different groups of people with different multipliers, and explore how much the multiplier itself contributes to winning. With the multiplier distribution of the winners obtained in scenario 1, we just need to calculate the multiplier distribution of the applicants, and then join and aggregate the two datasets to get the desired result.
    
    
    // Step01: Filter out the data of the applicants from 2016 to 2019, calculate the multipliers of each applicant for each round, and select the maximum multiplier in all batches as the final multiplier of the applicant. Finally, calculate the number of applicants at each multiplier.
    val apply_multiplier_2016_2019 = applyNumbersDF
    .filter(col("batchNum") >= "201601")
    .groupBy(col("batchNum"), col("carNum"))
    .agg(count(lit(1)).alias("multiplier"))
    .groupBy("carNum")
    .agg(max("multiplier").alias("multiplier"))
    .groupBy("multiplier")
    .agg(count(lit(1)).alias("apply_cnt"))
    
    // Step02: Left join the number of applicants at each multiplier with the number of winners at each multiplier, and calculate the winning ratio at each multiplier.
    val result05_02 = apply_multiplier_2016_2019
    .join(result05_01.withColumnRenamed("cnt", "lucky_cnt"), Seq("multiplier"), "left")
    .na.fill(0)
    .withColumn("ratio", round(col("lucky_cnt")/col("apply_cnt"), 5))
    .orderBy("multiplier")
     
    result05_02.write.format("csv").save("_")
    

The winning ratio at different multipliers is shown in the following figure: The x-axis represents the multipliers, and there are two y-axes. The blue bar represents the number of winners, which has the same distribution as in scenario 1; the green bar represents the winning ratio, which represents the ratio between the number of winners and the number of applicants at the same multiplier.

![](../images/835acb677d554f3baf195fea3ff07ddd.jpg)

Similar to the number of winners, the winning ratio also shows a normal distribution at different multipliers. With this additional data, we can solidify the conclusion obtained in scenario 1. That is, the contribution of the multiplier to winning is extremely limited. This conclusion explains well why it is difficult for people with high multipliers and long waiting times to win the lottery.

By analyzing the above scenarios, we have obtained answers regarding the distribution of lottery rounds, the trend of winning ratio changes, and the multiplier distribution and winning ratios.
## Summary

In today's lecture, we focused on developing a trend analysis application to answer various questions about the Beijing car lottery. This application mainly covers 5 cases, including lottery frequency distribution, changes in winning rate, significant changes in winning rate, and distribution of multipliers and winning ratio. To facilitate understanding, I summarized the problems they solve, answers, and main implementation ideas in the mind map below, you can take a look.

![](../images/18e048ce7dbf41aa9ece9fb3254e94ba.jpg)

As for the code implementation and execution results of these 5 cases, I have uploaded them to a public GitHub repository. You can get the complete content from [this address](https://github.com/wulei-bj-cn/potatoes.git).

Of course, the current code can definitely be optimized, and as for how to optimize it, I will tease you a bit and talk about it in more detail in the next lecture.
## Daily Exercise

1. If you were to implement a multiplier mechanism for the car lottery, how would you do it in a more rigorous way?
2. Based on the car lottery data from 2011 to 2019, what other interesting insights, perspectives, and cases can you think of that are worth further exploration?
3. Why do you think the contribution and impact of the multiplier on being selected are negligible?

Looking forward to seeing your thoughts and answers in the comments section. See you in the next lesson!