13 User Login How to Judge the Cause of Block in the Thread

Hello, I am Gao Lou.

In this lesson, let’s continue to “play” with the user login feature. In the 10th lesson, we analyzed the login function as an example to illustrate some key points in the benchmark scenario. However, we haven’t finished optimizing it yet, so we will continue to tinker with it in this lesson.

User login may seem like a very ordinary feature, but its logic is by no means simple. The login process involves comparing and verifying personal information, and during the verification process, corresponding encryption algorithms need to be called. Encryption algorithms are a type of feature with high performance requirements. Complex encryption algorithms have high security but poor performance, while simple encryption algorithms have good performance but low security. This is a trade-off.

In addition, there is session storage and synchronization. For large-scale systems, regardless of the system you access, if you need to authenticate your identity when calling other systems, you need to synchronize session information. When performing business operations, we also need to carry the corresponding session information, otherwise the recognition will fail.

As you can see, the login function actually involves many business aspects, and it is not simple at all. Therefore, in this lesson, I will analyze the user login function with you and help you understand the relationship between the business logic chain and the overall TPS under pressure. At the same time, I hope you can learn to determine the cause of blocking in a thread.

Modifying the Encryption Algorithm #

Do you remember the test results of the login business in the 10th lesson? TPS reached around 100 under 10 stress threads.

At the same time, in the 10th lesson, after we found that the encryption algorithm BCrypt was inefficient, we discussed two optimization methods: one was to use a faster encryption method, and the other was to remove the encryption algorithm altogether. At that time, I chose to remove the BCrypt encryption algorithm directly. In this lesson, let’s try the first method, changing it to MD5. There are two specific actions to take:

Change the encryption algorithm. Although the previous BCrypt encryption algorithm was secure, it had poor performance, so it is recommended to change it to MD5.
Load all users into Redis.

Let’s run the stress scenario again. Note that before running it again, we have only changed the encryption algorithm and have not executed the action of loading the cache. I hope to judge the result by doing only one action at a time (but please keep reading because we need to do both actions mentioned above). The result is as follows:

From the above results, the performance has improved slightly, but it has not reached the level I desired. I hope to see a significant increase in performance, rather than the current mediocre situation, which is quite frustrating. So, we still need to continue “tidying up” this interface, using cache to see how it affects the performance.

Testing the effectiveness of caching #

In order to determine if caching has had an effect on performance optimization, we can test it using two methods:

Reduce the amount of parameterized data and test with only a small amount of data (please note that we are just trying it out, we are not suggesting that running the scenario with a small amount of data is correct);
Load the entire cache directly.

We obtained the following results:

Looking at the curve, the login interface reached about 300 TPS. However, I still think it’s not good enough, because based on the hardware resources and my past experience, it should be able to achieve higher.

During the analysis, there were no hardware resource issues mentioned in Lesson 11 and Lesson 12, but here we can see from the global monitoring data that the CPU usage is high, indicating that it is indeed the business logic that is consuming CPU resources. Therefore, we have to start optimizing the login logic to address this issue.

After reading the source code, I have summarized the original login logic of this system:

This logic looks quite cumbersome. It is understandable that the Member service calls the Auth service. However, why does the Auth service have to fetch the username from the Member service? Wouldn’t it be better to directly query the cache or database? From an architectural design perspective, this design seems unnecessary to avoid sharing the database. However, in our optimization process, we need to make judgments based on the actual environment.

In our environment, we need to share the database, so the Auth service can directly use the database instead of going through the Member service. Therefore, let’s modify the login logic to the following new version, which will reduce one call.

After the modification, the TPS of the login is as follows:

From the results, TPS has indeed increased and reached over 700. Very good.

Does this mean we can end the analysis now? Not yet. We still need to identify the current bottleneck because, according to my performance philosophy, each business will have a bottleneck point. No matter how much we optimize, unless we constantly exhaust the hardware resources. So, let’s continue the analysis.

View the architecture diagram #

As usual, before analyzing performance bottlenecks, let’s first take a look at the architecture diagram to understand which services and technology components are involved in the user login interface.

From this architecture diagram, we can see that the login operation spans across three services: Gateway, Member, and Auth, and connects to two components: Redis and MySQL. Although MongoDB is shown in the diagram, it is actually not used for the login process.

After understanding this information, we can follow the analysis logic to systematically analyze the problem.

Splitting Time #

As mentioned earlier, the modified login logic results in the following TPS:

As you can see, the response time has increased to about 100ms. Therefore, we now need to determine where this time is being consumed. You may have noticed that the number of users in the image has increased to 150. This is to magnify the response time for analysis. Now let’s split this response time to find out where the problem lies.

Time on the Gateway service

Time on the Member service

Time on the Auth service

As we can see, the time consumption on the Member service is about 150ms, and the time consumption on the Auth service is about 60ms. We will focus on analyzing the Member service because it has a longer response time. Although the time on the Auth service is not too long, it has reached about 60ms, which I think is slightly longer in experience. It would be better if it can be averaged to below 50ms, so we should also pay a little attention to it.

Global Monitoring #

Our analysis logic remains unchanged: first look at global monitoring, then look at targeted monitoring. Looking at the data from the global monitoring graph below, the CPU usage of worker-7 and worker-8 is higher than the others.

Since the CPU usage of worker-7/8 is slightly higher, we need to check what services are running on these two nodes. So let’s take a look at the distribution of the pods to understand which worker node each pod is running on, in order to analyze the impact between the pods later:

It turns out that on worker-7 and worker-8, there are services running with slightly higher response times: the Auth service and the Member service. We need to analyze both of these services, but one at a time. Let’s start with the Auth service.

You might ask: why start with the Auth service? There’s no specific reason, except that its CPU usage is slightly higher. You might also wonder: other areas in the graph are also highlighted, why not pay attention to them? Let me analyze each one for you.

Worker-1 and worker-2 in the graph have relatively high memory usage, exceeding 70%. From experience, I rarely pay much attention to Linux’s memory usage unless there are a large number of page faults. This is because after Linux assigns memory to applications, it will be reflected in the cache. From the perspective of the operating system, memory cached by applications is considered to be used, but in reality, it may not actually be used. In this case, the operating system will include this portion of cache memory in the available memory calculation. Therefore, directly looking at the level of free memory at the operating system level is not enough to analyze any issues.
On worker-2, we can see that the number of TCP Time Waits is close to 30,000. However, this is not an area of concern for me, as Time Wait is a normal TCP state. I only pay attention to it if there is a lack of available ports or severe memory shortage.
As for the upload and download bandwidth of worker-1 and worker-2, it doesn’t seem to be significant. In our internal network structure, we have reached several Gbps of bandwidth during testing, so this level of bandwidth is not enough to grab our attention.

So, we still need to focus on dealing with worker-7 and worker-8.

Since the Auth service is on worker-7 and the Member service is on worker-8, as mentioned earlier, we might as well start with the Auth service.

Analysis of Auth Service Redirection #

Where should we start analyzing the Auth service? Well, we can take it step by step. Since it is the Auth service that is causing the high CPU usage on worker-7, we can follow the evidence chain below:

According to this evidence chain, we should start by looking at the processes. However, being a brave little fool (commonly known as: foolhardy), I directly went to look at the thread status, hoping to rely on experience to make a lucky guess. So, I opened the thread page of Spring Boot Admin:

Do you feel the sense of devastation? Such is life, full of surprises.

Trembling, I clicked on some of the red areas and saw information similar to this:

As you can see, the number of blocked threads is very large, reaching 842. In addition, the lock owner ID is 86676 and the lock owner name is thread 823.

Let’s take a look at the stack traces of two instances and see the relationship between the locks:

-- First instance:
"http-nio-8401-exec-884" #86813 daemon prio=5 os_prio=0 tid=0x00007f2868073000 nid=0x559e waiting for monitor entry [0x00007f2800c6d000]
   java.lang.Thread.State: BLOCKED (on object monitor
	at java.security.Provider.getService(Provider.java:1035)
	- waiting to lock <0x000000071ab1a5d8> (a sun.security.provider.Sun)
	at sun.security.jca.ProviderList.getService(ProviderList.java:332)
.....................
	at com.dunshan.mall.auth.util.MD5Util.toMD5(MD5Util.java:11)
	at com.dunshan.mall.auth.config.MyPasswordEncoder.matches(MyPasswordEncoder.java:23)
.....................
  at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2(AuthController.java:46
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2$accessor$jl0WbQJB(AuthController.java)
	at com.dunshan.mall.auth.controller.AuthController$auxiliary$z8kF9l34.call(Unknown Source)
.....................
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken(AuthController.java)
.....................

-- Second instance:
"http-nio-8401-exec-862" #86728 daemon prio=5 os_prio=0 tid=0x00007f28680d6000 nid=0x553a waiting for monitor entry [0x00007f2802b8c000]
   java.lang.Thread.State: BLOCKED (on object monitor
	at sun.security.rsa.RSACore$BlindingParameters.getBlindingRandomPair(RSACore.java:404)
	- waiting to lock <0x000000071ddad410> (a sun.security.rsa.RSACore$BlindingParameters)
	at sun.security.rsa.RSACore.getBlindingRandomPair(RSACore.java:443)
.....................
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2(AuthController.java:46)
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2$accessor$jl0WbQJB(AuthController.java)
	at com.dunshan.mall.auth.controller.AuthController$auxiliary$z8kF9l34.call(Unknown Source)
.....................
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken(AuthController.java)
.....................

These two stack traces did not occur at the same time, which means that this BLOCKED state has been persistent. Nevertheless, regardless of the circumstances, this stack is performing RSA encryption, and it is related to the Token part. Among them, the thread http-nio-8401-exec-884 is in BLOCKED state, which means that there are other threads holding this lock, so naturally we need to take a look at the “waiting to lock” <0x000000071ab1a5d8> in the thread stack. In fact, if you have experience, you can immediately know what the problem is in here. However, as people doing performance analysis, we need to explain the logic.

Let me digress for a moment. When you encounter a lock problem and don’t know the specific cause, you should instinctively print a complete stack to see, rather than randomly clicking in Spring Boot Admin. Why not recommend doing it this way? There are several reasons:

There are too many threads, visually following the logical relationships is tiring.
Constantly refreshing, causing eye strain.
I don’t like it.

Therefore, for the lock problem encountered earlier, the first thing we need to do is to go to the jstack in the container, print the stack, download it, and then open it with a tool to take a look.

You may ask, why not use tools like Arthas to directly view it in the container? The main reason is that the Arthas dashboard looks really exhausting when there are a lot of threads.

The following image is the stack printed by jstack, opened with a tool after downloading.

Doesn’t it feel like buying a lottery ticket? There are so many threads in BLOCKED state (up to 842), but none of them are found! I originally wanted to catch threads in BLOCKED state with the description “Waiting on monitor”, but according to the above statistics of thread descriptions, none of them are found. Hmph, I’m really angry.

At this point, as someone doing performance analysis, we must remember to pour a cup of tea, calm our minds, and silently execute jstack repeatedly. Here, I executed it continuously 10 times, and then looked for the state of each stack.

Finally, the “Waiting on monitor” comes:

Looks like there’s something to play with! Next, let’s take a look at who is blocking the above threads.

First, in the corresponding stack, let’s find the stack that holds the lock. The following are the blocking relationships in the stack.

First stack

Second stack

Note that these are two stack files. So, we need to find their respective waiting relationships from these two stack files. The following code corresponds to the above Waiting thread.

-- First place
"http-nio-8401-exec-890" #86930 daemon prio=5 os_prio=0 tid=0x00007f28680a5800 nid=0x561d waiting for monitor entry [0x00007f2800263000]
   java.lang.Thread.State: BLOCKED (on object monitor
	at java.security.Provider.getService(Provider.java:1035)
	- locked <0x000000071ab1a5d8> (a sun.security.provider.Sun)
	at sun.security.jca.ProviderList.getService(ProviderList.java:332)
	.....................
	at com.dunshan.mall.auth.util.MD5Util.toMD5(MD5Util.java:11)
	at com.dunshan.mall.auth.config.MyPasswordEncoder.matches(MyPasswordEncoder.java:23)
.....................
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2(AuthController.java:46)

	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2$accessor$jl0WbQJB(AuthController.java)
	at com.dunshan.mall.auth.controller.AuthController$auxiliary$z8kF9l34.call(Unknown Source)
	at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86)
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken(AuthController.java)


-- Second occurrence
"http-nio-8401-exec-871" #86739 daemon prio=5 os_prio=0 tid=0x00007f28681d6800 nid=0x5545 waiting for monitor entry [0x00007f2801a7b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at sun.security.rsa.RSACore$BlindingParameters.getBlindingRandomPair(RSACore.java:404)
	- locked <0x000000071ddad410> (a sun.security.rsa.RSACore$BlindingParameters)
	at sun.security.rsa.RSACore.getBlindingRandomPair(RSACore.java:443)
.....................	
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2(AuthController.java:46)
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken$original$sWMe48t2$accessor$jl0WbQJB(AuthController.java)
	at com.dunshan.mall.auth.controller.AuthController$auxiliary$z8kF9l34.call(Unknown Source)
	at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86)
	at com.dunshan.mall.auth.controller.AuthController.postAccessToken(AuthController.java)

You can see the lock ID on the line that says “locked”. Since we have found the stack traces holding the locks in these two instances, we can find the code for these two stacks from the source code:

– Synchronized block in the first instance:

public synchronized Service getService(String type, String algorithm) {
    checkInitialized();
    // avoid allocating a new key object if possible
    ServiceKey key = previousKey;
    if (!key.matches(type, algorithm)) {
        key = new ServiceKey(type, algorithm, false);
        previousKey = key;
    }

    if (serviceMap != null) {
        Service service = serviceMap.get(key);
        if (service != null) {
            return service;
        }
    }
    
    ensureLegacyParsed();
    return (legacyMap != null) ? legacyMap.get(key) : null;
}

– Synchronized block in the second instance:

// return null if need to reset the parameters
BlindingRandomPair getBlindingRandomPair(
        BigInteger e, BigInteger d, BigInteger n) {

    if ((this.e != null && this.e.equals(e)) ||
        (this.d != null && this.d.equals(d))) {

        BlindingRandomPair brp = null;
        synchronized (this) {
            if (!u.equals(BigInteger.ZERO) &&
                !v.equals(BigInteger.ZERO)) {

                brp = new BlindingRandomPair(u, v);
                if (u.compareTo(BigInteger.ONE) <= 0 ||
                    v.compareTo(BigInteger.ONE) <= 0) {
                    // need to reset the random pair next time
                    u = BigInteger.ZERO;
                    v = BigInteger.ZERO;
                } else {
                    u = u.modPow(BIG_TWO, n);
                    v = v.modPow(BIG_TWO, n);
                }
            } // Otherwise, need to reset the random pair.
        }
        
        return brp;
    }
    
    return null;
}

From here, we can see that the first lock is taken by the getService method in the JDK, which uses a global synchronized lock, causing contention when allocating a key. This has been described in the JDK Bug List, specifically JDK-7092821. So, it’s not really a bug, and if you want to change it, you can consider using a different library.

The second lock is taken by the RSA method in the JDK, which is designed to prevent timing attacks. RSA involves calculations with large prime numbers, and for thread safety, locks are added. You can find the RSA logic in the source code of /sun/security/rsa/RSACore.java.

However, RSA is an inefficient encryption method, and when there is high load, the synchronized blocks like these will inevitably lead to blocking. Regarding the second lock, there is a comment in the source code suggesting calculating u and v first to improve efficiency.

Now that we know the reasons for these two locks causing blocking, the next steps are relatively simple:

For the first lock, implement your own method or consider using a distributed lock.
For the second lock, consider using a more efficient implementation.

At this point, we have identified the BLOCKED sections in the application. As this is a performance analysis, I will not continue with further steps. If you are working on a project, you can hand off the problem to the development team and take a break, leaving them to figure it out. Just kidding! You shouldn’t really do that. As performance analysts, we need to provide a logical and evidence-based analysis process, which will make communication with other team members more effective.

Analysis of Member Service Direction #

After analyzing the Auth service, let’s take a look at the performance of the Member service. Since we have already shown the global monitoring data before, we won’t repeat it here. Let’s directly analyze the response time when invoking the Member service.

Response time on Gateway

Response time on Member

Response time on Auth

From the information above, we can see that each section has a different time consumption: around 80 milliseconds on the Member service and around 60 milliseconds on the Auth service, which is quite high.

Let’s log in to the Member service and take a look at the overall resource usage. Here, I used the classic and traditional top command:

%Cpu0  : 63.8 us, 12.4 sy,  0.0 ni,  9.2 id,  0.0 wa,  0.0 hi, 14.2 si,  0.4 st
%Cpu1  : 60.3 us, 11.7 sy,  0.0 ni, 11.0 id,  0.0 wa,  0.0 hi, 16.6 si,  0.3 st
%Cpu2  : 59.4 us, 12.0 sy,  0.0 ni, 14.1 id,  0.0 wa,  0.0 hi, 13.8 si,  0.7 st
%Cpu3  : 59.8 us, 12.1 sy,  0.0 ni, 11.7 id,  0.0 wa,  0.0 hi, 15.7 si,  0.7 st

Looking at the distribution of CPU usage, all other counters are normal except for si, which is a bit high. This is a network interruption problem. Although there is room for optimization, due to the limitations of the underlying infrastructure, there is not much room for performance improvement. This is also one reason why many companies have now abandoned virtualization and directly adopted containerization.

I will carefully explain this network interruption issue in future lessons, but we will not discuss it in detail in this lesson.

Summary #

In this lesson, I walked you through a complete performance analysis scenario using the login feature.

In the earlier part of the code modifications, the performance analysis process was relatively fast. We simply looked at which parts of the code logic consumed more time. This approach is what I mentioned before as the “chain of evidence” for CPU usage.

Next, when analyzing the Auth service, we started from the breakdown of time and gradually moved into the code. The most important part was going from CPU to stack, and then determining whether it was BLOCKED. When we see BLOCKED on the stack, it is important to remember to print the stack information. However, because some locks are acquired and released very quickly, there may be cases where the stack information for a particular lock is not found in the entire stack file even though it appears in the stack when printing. In such cases, you need to pay attention and continue printing the stack several times until you capture the corresponding lock.

This is a key aspect of analyzing locks in the stack because we often see the stack information for waiting locks but not for locks that are held. By printing the stack several times in succession, we ensure that the stack information for both held and waiting locks is printed, otherwise we cannot analyze the logic properly.

Next, once we have identified the stack for the held lock, we need to trace back through the call hierarchy of our business code to find where the lock was acquired. Whether or not this lock acquisition is reasonable depends on the business logic. As a performance analyst, this is the time to bring together the development, business, and architecture teams for discussion. Whether or not to change the lock should be decided collectively, rather than solely by the performance analyst.

Through the above analysis, I believe you can see that in my performance analysis logic, it is important to understand both the phenomenon and the underlying principles.

Homework #

Finally, I have a few questions for you to consolidate what we have learned today.

Why do we need to repeatedly display the stack information when we see the word “BLOCKED”?
Why do we need to go from phenomena to principles in performance analysis?
What are some optimization approaches for inefficient code?

Remember to discuss and exchange your thoughts with me in the comments section. Each reflection will help you progress further.

If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lecture!