31 Performance Tuning Hands on Guide You to Improve the Execution Performance of Your Application

31 Performance Tuning Hands-On - Guide You to Improve the Execution Performance of Your Application #

Hello, I’m Wu Lei.

In the previous lesson, we developed an application for analyzing the trend of car lottery, solving 5 cases together. In today’s lesson, we will optimize the performance of these 5 cases one by one, applying the knowledge and techniques we have learned in this column to practical use.

Since there are multiple cases in the trend analysis application, in order to conveniently compare the performance before and after optimization for each case, let’s first align and unify the methodology for performance comparison testing.

First of all, our performance comparison testing is based on a case-by-case granularity, which is often referred to as “Case By Case”. Then, for each case, we have a baseline for comparison. The meaning of the baseline is the running time obtained by directly executing the code without applying any optimization methods. Afterwards, for each case, we will apply one or more optimization methods for performance improvement, and each optimization method has its corresponding running time. In the end, we will horizontally compare the running times of different optimization methods with the baseline, in order to observe the performance difference before and after optimization, and analyze the underlying reasons behind the performance improvement/deterioration.

Without further ado, let’s start today’s lesson directly!

Runtime Environment #

Since the tuning effect is mainly reflected by the execution time, it is necessary to explain the hardware resources and configuration settings used in the performance testing before starting the tuning. The hardware resources are shown in the table below.

In order to avoid waiting for too long due to the experiment itself, I used relatively powerful machine resources. In fact, to run the application and complete the performance comparison test, you can also use a laptop. Moreover, in order to leave enough space for subsequent tuning, except for the necessary runtime resource settings, all other configuration options are kept at their default values. The specific resource configuration is shown in the table below.

In addition, since the tuning methods involve new features of Spark 3.0 such as AQE and DPP, I recommend using Spark version 3.0 or above to deploy the runtime environment. The version I used here is Spark 3.1.1.

Next, we will review the code implementation case by case, analyze the optimization space, possible tuning methods, the effectiveness of the methods, and the performance differences with the baseline for each of the 5 cases.

Performance Optimization of Case 1: Number Statistics #

First, let’s review Case 1. The intention of Case 1 is to count the total number of applicants, the number of lucky winners, and the total number of applicants after removing the effect of the lottery multiplier. The code is shown below.

val rootPath: String = _

// Applicant data (due to the multiplier, the same person may have multiple numbers for each round)
val hdfs_path_apply = s"${rootPath}/apply"
val applyNumbersDF = spark.read.parquet(hdfs_path_apply)
applyNumbersDF.count

// Lucky winners data
val hdfs_path_lucky = s"${rootPath}/lucky"
val luckyDogsDF = spark.read.parquet(hdfs_path_lucky)
luckyDogsDF.count

// Applicant data (without the effect of the multiplier)
val applyDistinctDF = applyNumbersDF.select("batchNum", "carNum").distinct
applyDistinctDF.count

From the above code implementation, we can see that in just a few lines of code, there are 3 Actions, which are 3 count operations on different datasets. These 3 Actions will trigger 3 Spark Jobs. Among them, the first 2 Jobs are to read the data source and immediately count, with no optimization space. The third Job is to remove duplicates on applyNumbersDF and then count. Based on the previous discussion on different cases, we know that these 3 datasets (applyNumbersDF, luckyDogsDF, and applyDistinctDF) will be repeatedly referenced in subsequent cases.

Because the reference count of these 3 datasets is too frequent, we can even determine without calculating the “percentage of runtime cost” that using Cache will definitely help improve execution performance.

General principles of using Cache:

  • If the reference count of an RDD/DataFrame/Dataset in the application is 1, do not use Cache.
  • If the reference count is greater than 1 and the percentage of runtime cost exceeds 30%, consider enabling Cache.

Therefore, for the third Job, we can use the Cache mechanism to improve execution performance. The optimization method is simple: just add a line of code applyNumbersDF.cache before applyNumbersDF.count.

Since the focus of the performance comparison test in this case is the third Job, in order to facilitate horizontal comparison, we need to remove irrelevant Jobs and code. The optimized and baseline code after sorting can be found in the table below.

Table 1

Next, we package, deploy, and execute these two sets of code separately, and record the execution time of the Job applyDistinctDF.count to complete the performance comparison test. The execution results are recorded in the table below.

Table 2

From the table, we can see that the execution performance after optimization has increased by 20% compared to the baseline. To be honest, this improvement is expected. After all, the former consumes disk I/O, while the optimized count Job directly retrieves data from memory.

Performance Optimization of Case 2: Distribution of Lottery Times #

Next, let’s analyze Case 2. Case 2 is divided into two scenarios. The first scenario is used to analyze the distribution of the number of lottery batches participated by applicants. The second scenario is similar, but it is focused on the lottery winners and is mainly used to answer questions like “How many times do lottery winners usually need to participate in the lottery to win?”.

Scenario 1: Applicants Participating in the Lottery #

Let’s review the code implementation of Scenario 1. By carefully studying the code, we can see that Scenario 1 is a typical single-table Shuffle, and there are two Shuffles in total. The first Shuffle operation groups and counts based on the “carNum” column, and the second Shuffle groups and counts again based on the “x_axis” column.

val result02_01 = applyDistinctDF
.groupBy(col("carNum"))
.agg(count(lit(1)).alias("x_axis"))
.groupBy(col("x_axis"))
.agg(count(lit(1)).alias("y_axis"))
.orderBy("x_axis")
 
result02_01.write.format("csv").save("_")

Therefore, the calculation in Scenario 1 is essentially just a Word Count, but with the first Word being “carNum” and the second Word being “x_axis”. So, what optimization ideas do we have for this “Word Count”?

In the lecture on configuration optimization, we specifically introduced some common methods for Shuffle optimization, such as adjusting read/write buffer sizes and bypassing sorting operations. In addition, we must keep in mind that Shuffle is essentially a redistribution of data, so we need to pay attention to data distribution wherever there is a Shuffle operation. Therefore, for data partitions that are too small, we need to consciously merge them. Furthermore, in Case 1, we mentioned that applyNumbersDF, luckyDogsDF, and applyDistinctDF will be repeatedly referenced in the subsequent cases, so adding Cache to applyDistinctDF is also a logical thing to do.

With so many optimization ideas, in order to demonstrate the performance improvement of each optimization method, I will start with the three directions of common operations, data partition merging, and Cache addition to perform performance optimization on Scenario 1. However, it should be noted that the purpose of doing this is, first, to broaden our optimization ideas, and second, to review the optimization techniques we have learned before.

Of course, in actual work, we generally don’t have time and energy to try one method after another like we are doing now. So the most efficient approach would be to follow the optimization methodology we have always emphasized, which is to deal with the bottleneck of the “bucket” first, eliminate the bottleneck, prioritize solving the main contradictions, and then, if time and resources permit, deal with the secondary “bucket”.

So the question is, which “bucket” do you think each of the three optimization ideas is addressing? Which one of these “buckets” is the shortest? Which optimization technique would you prioritize? Next, with these questions in mind, we will proceed to optimize Scenario 1 one by one.

Idea 1: Shuffle Common Optimization #

Just now, we mentioned that there are two types of common optimizations for Shuffle: one is bypassing sorting operations and the other is adjusting read/write buffers. Bypassing sorting operations has two prerequisites: one is that the calculation logic does not involve aggregation or sorting, and the other is that the parallelism of the Reduce phase is less than the setting of the parameter spark.shuffle.sort.bypassMergeThreshold. Obviously, Scenario 1 does not meet the requirements as the calculation logic includes both aggregation and sorting. So, we can only adjust the read/write buffers.

In fact, the optimization of read/write buffers also has prerequisites, because this memory consumption will occupy the Execution Memory area, so the precondition for increasing the buffer size is that the Execution Memory has sufficient space. Since we are using powerful hardware resources and the overall size of the car lottery data is relatively small, we still have some “resources” to optimize the read/write buffers. Specifically, we need to adjust the following two configuration parameters:

  • spark.shuffle.file.buffer, buffer size for writing data in the Map phase
  • spark.reducer.maxSizeInFlight, buffer size for reading data in the Reduce phase

Since read/write buffers are set at the task granularity, we need to be careful when adjusting these two parameters. Generally, 50% is often a good start. The table below shows the comparison between the baseline and the optimization settings.

I recorded the running time of the two sets of comparative experiments in the table below. From the table, we can see that adjusting these two parameters has little effect on the end-to-end execution performance of the job. However, this scenario where parameter adjustments did not significantly improve the execution efficiency may seem familiar to you. In this case, the best approach is to continue to rely on the “shortcomings of the bucket”, identify bottlenecks, and use the “optimization methodology” to try other optimization ideas.

Idea 2: Data Partition Merging #

Next, let’s discuss the second idea, data partition merging. First, let’s analyze together whether there is a problem of data partitions being too small in Scenario 1. To facilitate the analysis, let’s review the code again. Since the calculation of Scenario 1 is based on the dataset applyDistinctDF, to answer the previous question, we need to consider the storage size of the dataset applyDistinctDF and the parallelism of the Reduce phase after the Shuffle calculation.

val result02_01 = applyDistinctDF
.groupBy(col("carNum"))
.agg(count(lit(1)).alias("x_axis"))
 
.groupBy(col("x_axis"))
.agg(count(lit(1)).alias("y_axis"))
 
.orderBy("x_axis")
 
result02_01.write.format("csv").save("_")

The parallelism is determined by the configuration parameter spark.sql.shuffle.partitions, which has a default value of 200, meaning there are 200 data partitions. To estimate the storage size of the dataset, we need to use the following function.

def sizeNew(func: => DataFrame, spark: => SparkSession): String = {
 
  val result = func
 
  val lp = result.queryExecution.logical
val size = spark.sessionState.executePlan(lp).optimizedPlan.stats.sizeInBytes

"Estimated size: " + size/1024 + "KB"

}

The sizeNew function can return the exact size of the dataset in memory given a DataFrame. By calling the sizeNew function with applyDistinctDF as the argument, the estimated size returned is 2.6 GB. Dividing the dataset size by the degree of parallelism, we can obtain the storage size of each data shard in the Reduce phase, which is 13 MB (2.6 GB / 200). Generally speaking, the size of data shards should be around 200 MB, so the 13 MB shard size is obviously too small.

As mentioned in the lecture on scheduling systems (lecture 5), if the data shards to be processed are too small, the task scheduling overhead will become significant compared to the data processing, leading to reduced CPU utilization and degraded execution performance. Therefore, in order to improve CPU utilization and overall execution efficiency, we need to merge the excessively small data shards. At this point, the automatic partition merging feature of AQE can help us with this task.

However, in order to fully utilize the automatic partition merging feature of AQE, we also need to adjust relevant configuration settings. Here, you can simply see how these configuration settings are set for scenario 1.

Once the AQE mechanism is enabled, the automatic partition merging feature will be automatically activated. There are two configuration settings in the table that we need to pay special attention to: the minimum number of partitions (minPartitionNum) and the target size after merging (advisoryPartitionSizeInBytes).

Let's first look at the minimum number of partitions, which refers to the number of partitions after merging that cannot be lower than this parameter setting. Since we have calculated that the number of Executors in the cluster configuration is 6, in order to ensure that each CPU is busy and has work to do, we can set minPartitionNum to 6.

Next is the target size after partition merging. As we just mentioned, the empirical value for partition size is around 200 MB, so we can set advisoryPartitionSizeInBytes to 200 MB. However, in order to compare the impact of different partition sizes on execution performance, we can conduct several experiments.

The comparison of experimental results before and after configuration adjustments is shown in the table below. It can be seen that the runtime after tuning has been slightly reduced, indicating that partition merging is helpful for improving CPU utilization and overall execution performance. If we carefully observe the table, we can draw at least 3 insights.

  * **If the degree of parallelism is too high and the data shards are too small, the CPU scheduling overhead will increase and the execution performance will worsen.**
  * **When the shard size is around 200 MB, the execution performance is often optimal.**
  * **If the degree of parallelism is too low and the data shards are too large, the CPU data processing overhead will also be large, and the execution performance will decline.**

#### Approach 3: Add Cache

The last approach is to add cache. This optimization technique is very simple to use, and we have demonstrated it in Scenario 1, so here we directly provide the optimized code and the results.

![](../images/2f5efbdb3b934dd293a83928ac9227fa.jpg)

As we can see, using the Cache mechanism for optimization significantly improves the performance of task execution.

image

So far, we have tried three optimization methods for optimizing Scenario 1. These methods are Shuffle read/write buffer adjustment, data partition merging, and adding Cache. The first method targets the number of disk and network requests during the Shuffle process. The second method aims to improve the CPU utilization of the Reduce phase. The third method addresses the issue of duplicate scanning and duplicate calculation of the dataset on the disk.

In fact, there is no need for quantitative analysis. Just qualitatively, we can see that the cost of duplicate scanning and calculation of the dataset is the highest. Therefore, in practical work, for similar “multiple choice questions,” we naturally prioritize the third method that can eliminate bottlenecks.

Scenario 2: Lucky Winners #

After completing the optimization of single-table Shuffle in Scenario 1, let’s now take a look at Scenario 2. The business goal of Scenario 2 is to get the distribution of draw frequencies for lucky winners. First, let’s review the code implementation for Scenario 2. The calculation in Scenario 2 involves data association, two grouping and aggregation operations, and the final sorting operation. It is not difficult to see that, except for the association calculation, the other calculation steps are identical to those in Scenario 1. Therefore, for the optimization of Scenario 2, we focus on the first step of data association, and the subsequent optimization can be similar to the tuning methods used in Scenario 1.

val result02_02 = applyDistinctDF
    .join(luckyDogsDF.select("carNum"), Seq("carNum"), "inner")
    .groupBy(col("carNum")).agg(count(lit(1)).alias("x_axis"))
    .groupBy(col("x_axis")).agg(count(lit(1)).alias("y_axis"))
    .orderBy("x_axis")
     
result02_02.write.format("csv").save("_")

The two tables involved in the association are applyDistinctDF and luckyDogsDF. applyDistinctDF consists of unique draw data, and luckyDogsDF contains the application numbers and batch numbers of the lucky winners. applyDistinctDF contains 1.35 million data records, while luckyDogsDF contains only 1.15 million data records. It is obvious that the association between the two involves a “big table join small table” scenario commonly seen in data warehouses.

When encountering a “big table join small table” calculation scenario, the first optimization technique we should consider is broadcast variables. After all, we have been emphasizing the advantages and benefits of Broadcast Joins. Here, I emphasize again that you must master the tuning techniques of using broadcast variables to optimize data associations. It is not an exaggeration to say that broadcast variables are the most cost-effective tuning technique, without equal.

To use broadcast variables to optimize the association calculation between applyDistinctDF and luckyDogsDF, we need to do two things. The first is to estimate the storage size of the luckyDogsDF table in memory. The second is to set the broadcast threshold configuration item spark.sql.autoBroadcastJoinThreshold.

For the estimation of the size of the distributed dataset, we still use the sizeNew function. We pass luckyDogsDF as the argument and call the sizeNew function. The estimated size returned is 18.5MB. With this reference value, we can now set the broadcast threshold. To convert the association calculation between applyDistinctDF and luckyDogsDF into a Broadcast Join, we just need to set the broadcast threshold to be greater than 18.5MB, so we can set this parameter to 20MB.

image

I recorded the experimental results before and after adjusting the configuration item in the following table. Clearly, compared to the default Shuffle Sort Merge Join implementation mechanism, Broadcast Join performs better in terms of execution performance.

image

Performance Optimization for Case 3: Trend of Winning Rate #

The business goal of Case 3 is to understand the trend of the winning rate. Let’s review the code first. To calculate the winning rate, we need to take two steps. In the first step, we group the applyDistinctDF and luckyDogsDF by batchNum and then count the number of applicants and winners in each group separately. In the second step, we divide the two types of counts through data association to obtain the winning rate for each batch.

// Count the number of applicants per batch
val apply_denominator = applyDistinctDF
.groupBy(col("batchNum"))
.agg(count(lit(1)).alias("denominator"))

// Count the number of winners per batch
val lucky_molecule = luckyDogsDF
.groupBy(col("batchNum"))
.agg(count(lit(1)).alias("molecule"))

val result03 = apply_denominator
.join(lucky_molecule, Seq("batchNum"), "inner")
.withColumn("ratio", round(col("molecule")/col("denominator"), 5))
.orderBy("batchNum")

result03.write.format("csv").save("_")

Since there are a total of 72 lottery batches from 2011 to 2019, the result sets of the first step, apply_denominator and lucky_molecule, each have 72 records. Obviously, there is no room for optimization for such small datasets.

Therefore, for Case 3, the key to optimization lies in the two single-table Shuffles involved in the first step. We have conducted a detailed analysis and explanation of optimization ideas and techniques for single-table Shuffles in Scenario 1 of Case 2. Therefore, the optimization of the Shuffle for applyDistinctDF and luckyDogsDF tables is left as an exercise for you to practice after class.

Performance Optimization of Case 4: Local Insights into Lottery Winning Rate #

Unlike Case 3, Case 4 only focuses on the change trend of lottery winning rate in 2018. Let’s first review the code implementation of Case 4.

// Filter out the lottery winning data in 2018 and count the number of winners by batch
val lucky_molecule_2018 = luckyDogsDF
.filter(col("batchNum").like("2018%"))
.groupBy(col("batchNum"))
.agg(count(lit(1)).alias("molecule"))

// Join the filtered lottery winning data with the original data by batch and calculate the winning rate for each period
val result04 = apply_denominator
.join(lucky_molecule_2018, Seq("batchNum"), "inner")
.withColumn("ratio", round(col("molecule")/col("denominator"), 5))
.orderBy("batchNum")

result04.write.format("csv").save("_")

From the code implementation, the only change in Case 4 compared to Case 3 is the addition of a filtering condition for the lottery batch number, filter(col("batchNum").like("2018%")), before counting the statistics using luckyDogsDF. You might say, “The change in Case 4 seems minor, and its optimization potential and methods should be similar to Case 3.” Well, not exactly. Do you remember the DPP feature introduced in Spark 3.0? The seemingly insignificant filter predicate added to the luckyDogsDF table in Case 4 happens to make DPP useful.

In the DPP lecture, we introduced three prerequisites for enabling DPP:

  • The fact table must be a partitioned table, and the partitioning field(s) (can be multiple) must include the join key.
  • DPP only supports equi-joins and does not support inequality relationships such as greater than or less than.
  • The data set after filtering the dimension table must be smaller than the broadcast threshold. Therefore, you need to adjust the spark.sql.autoBroadcastJoinThreshold configuration.

So, how do these three prerequisites affect the performance optimization of Case 4?

First, in the previous lecture, we introduced the directory structure of the lottery data. The data in the apply and lucky directories are stored in partitions based on the batchNum column. Therefore, both applyDistinctDF and luckyDogsDF, which are involved in the join calculation, are partitioned tables, and the partition key batchNum happens to be the join key. Secondly, the join calculation in Case 4 is clearly an equi-join.

Finally, we just need to ensure that the result set of lucky_molecule_2018 is smaller than the broadcast threshold to trigger the DPP mechanism. In 2018, there were only 6 lottery draws, which means that the lucky_molecule_2018 resulting from the group counting only has 6 records. Such a small “data set” can easily fit into a broadcast variable.

As a result, Case 4 meets all the prerequisites of DPP. By utilizing the DPP mechanism, we can reduce the amount of data scanned in applyDistinctDF and thus improve the overall execution performance of the job.

Performance Comparison

The core function of DPP is to reduce the disk scan volume of the fact table applyDistinctDF. Therefore, the optimization method for Case 4 is very simple: just remove the cache added to applyDistinctDF in the original case, as shown on the right side of the table above. At the same time, in order to be fair, the benchmark for comparison should not only be the test case where DPP is disabled but also include the test case where applyDistinctDF is cached. Additionally, directly comparing the disk read efficiency of DPP and the memory read efficiency of cache can deepen our understanding of the DPP mechanism.

When executing these two test cases, the results are as follows. As can be seen, compared to the benchmark, Case 4 achieves an end-to-end execution performance improvement of nearly 5 times with the help of the DPP mechanism. This demonstrates that disk reading under the DPP mechanism is much more efficient than memory reading with the data set cached.

Execution Performance Comparison

Performance Optimization of Case 5: MPR Analysis #

Case 5 also includes two scenarios. The business goal of scenario 1 is to calculate the number of people selected at different MPRs (Multiplier). Scenario 2 is slightly different from scenario 1, as its purpose is to calculate the selection ratio at different MPRs.

Although the calculation logic of the two scenarios is different, the optimization ideas and methods are the same. Therefore, in Case 5, we only need to discuss, analyze, and compare the performance optimization of scenario 1. Let’s first review the code implementation of scenario 1.

val result05_01 = applyNumbersDF
    .join(luckyDogsDF.filter(col("batchNum") >= "201601")
    .select("carNum"), Seq("carNum"), "inner")
    .groupBy(col("batchNum"), col("carNum"))
    .agg(count(lit(1)).alias("multiplier"))
    .groupBy("carNum")
    .agg(max("multiplier").alias("multiplier"))
    .groupBy("multiplier")
    .agg(count(lit(1)).alias("cnt"))
    .orderBy("multiplier")
 
result05_01.write.format("csv").save("_")

After carefully studying the code, we found that the calculation in scenario 1 consists of the following steps:

  • Join calculation between a large table and a small table, with a filter condition applied to the small table.
  • Statistical counting based on the batchNum column.
  • Taking the maximum value based on the carNum column.
  • Statistical counting based on the multiplier column.

Among these four steps, the join calculation involves the largest amount of data scanning and processing. Therefore, this step is the key to the efficiency of Case 5 execution. In addition, apart from the join calculation step, the other three steps fall into the category of optimization for single-table Shuffle, and the optimization techniques used in scenario 1 of Case 2 can be applied here as well.

Therefore, for the performance optimization of Case 5, we focus on the first step, which is the join calculation between applyNumbersDF and luckyDogsDF.

Upon careful observation of the join calculation in the first step, we found that the join key is carNum, which is not the partition key of applyNumbersDF and luckyDogsDF. Therefore, in this join query, we cannot optimize using the DPP mechanism.

However, the inner join between applyNumbersDF and luckyDogsDF is a typical “Join Large Table with Small Table” scenario. In such cases, we have at least two methods to transform the inefficient SMJ (Sort Merge Join) into the efficient BHJ (Broadcast Hash Join).

The first method is to compute the in-memory storage size of the original dataset luckyDogsDF to ensure that it is smaller than the broadcast threshold. This way, we can utilize the static optimization mechanism of Spark SQL to transform SMJ into BHJ. The second method is to ensure that the filtered luckyDogsDF is smaller than the broadcast threshold, so that we can dynamically transform SMJ into BHJ using Spark SQL’s Adaptive Query Execution (AQE) mechanism.

Next, we will optimize using these two methods, and compare the performance differences between them, as well as between them and the baseline. In Scenario 2 of Case 2, we calculated that the in-memory storage size of luckyDogsDF is 18.5MB. Therefore, by adjusting spark.sql.autoBroadcastJoinThreshold appropriately, we can flexibly switch between these two optimization methods.

Comparison of Join Execution Strategies

Implementing the three test cases, in terms of execution efficiency, SMJ is undoubtedly the worst, while the dynamic optimization by AQE falls between SMJ and the static transformation of Spark SQL. After all, adjusting the join strategy with AQE is a “better late than never” optimization mechanism, and the Shuffle calculation of the two participating tables in the Join has already been executed halfway before transforming SMJ into BHJ. Therefore, its execution efficiency must be worse than the static optimization of Spark SQL. Nevertheless, the BHJ after dynamic adjustment by AQE is still much more efficient than the default SMJ, which demonstrates the value of AQE optimization mechanism.

Comparison of Execution Efficiency

Conclusion #

In today’s lecture, we combined the knowledge points and optimization techniques we have learned before to perform performance optimization in a case-by-case manner using the example of car lottery. The optimization techniques involved include adjusting Shuffle read and write buffer, adding Cache, estimating the storage size of the dataset, Spark SQL static optimization, AQE dynamic optimization (automatic partition merging and join strategy adjustment), and DPP mechanism. To facilitate your comparison, I summarized them in a mind map. However, I hope you can summarize them yourself, so that you can remember them better.

Finally, I want to say that it is regrettable that we were not able to put all the optimization techniques from the column into practice in this case study. This is mainly because the car lottery application is relatively simple and it is difficult to cover all the calculation and optimization scenarios. For those optimization techniques that have not been put into practice, you can only rely on practicing them in your daily work.

However, the comment section of the column and our reader group will always be open to you. Although I may not be able to reply immediately, I can promise that for your comments, I will only be late, never absent!

Daily Exercise #

  1. Can you refer to scenario 1 of Case 2 and optimize the single-table Shuffle for the applyDistinctDF and luckyDogsDF tables in Case 3?
  2. Can you refer to scenario 1 of Case 5 and apply optimization techniques such as AQE and Broadcast Join to optimize the performance of scenario 2 in Case 5?

Looking forward to seeing your optimization results in the comments, and feel free to ask any questions. See you in the next lesson!