Practice Sample to Run Hot Issue Q& a 3rd Phase

Practice Sample to Run Hot Issue Q&A 3rd Phase #

Hello, I am Sun Pengfei. It’s time for Q&A again. Today, I will discuss the implementation of two samples from the 6th volume and the supplementary part, focusing on frame freezing optimization.

The sample in the 6th volume is completely based on Facebook’s performance analysis framework Profilo. Its main function is to collect atrace logs from online users. I believe we are all familiar with atrace, which is often used. The systrace tool we usually use encapsulates the atrace command to enable ftrace events and generate visual HTML logs by reading the ftrace buffer. By the way, ftrace is a commonly used kernel tracing and debugging tool under Linux. If you are not familiar with it, you can refer to the introduction of ftrace at the end of the 6th volume. Atrace in Android has extended some categories and tags for its own use, and the sample captures synchronous events obtained through atrace.

The implementation of the sample is actually quite simple, and there are two solutions.

The first solution is to hook the series of methods for writing logs in atrace. Taking the code of Android 9.0 as an example, the code that writes ftrace logs is in trace-dev.cpp. Since the code of each version is slightly different, some differentiation needs to be made based on the system version.

The second solution, which is also the solution used in the sample, is that since all the atrace events are written through /sys/kernel/debug/tracing/trace_marker, atrace will write the value of this path’s file descriptor (fd) into the global variable atrace_marker_fd during initialization. We can easily obtain the value of this fd through dlsym. Regarding the file trace_marker, I need to explain that it involves some content of ftrace. Ftrace is originally an internal tracer designed to help developers and designers of systems to find out what is going on inside the kernel.

From the documentation, it can be seen that the main purpose of the ftrace tool is to investigate performance issues outside of user space. However, in many scenarios, we need to know the sequence of user space event calls and kernel events. Therefore, ftrace also provides a solution, which is to provide a file trace_marker. Writing content to this file can generate a ftrace record, and our events can be combined with kernel logs. But this design has a downside: when writing content to the file, a system call is made, and there will be a context switch from user space to kernel space. Although this method is not as efficient as the kernel writing directly, the ftrace tool is still very useful in many cases.

Therefore, user space event data is written through trace_marker, or more precisely, through the write interface. So, we only need to hook the write interface and filter out the content written under this fd. This solution has high versatility and can be easily completed using PLT Hook.

The next problem we will encounter is that in order to obtain atrace logs, we need to set up atrace’s category tags. The code tells us that the determination of whether a tag is enabled is calculated by atrace_enabled_tags & tag. If the result is greater than 0, it is considered enabled; if it is equal to 0, it is considered disabled. Below are some values of atrace_tag. As you can see, to determine whether a tag is enabled, we only need to compare the bit value on the left offset of the tag value with the bit value at the same offset in atrace_enabled_tags. This means that if I set all the bits of atrace_enabled_tags to 1, any atrace tag can be matched during calculation.

#define ATRACE_TAG_NEVER            0      
#define ATRACE_TAG_ALWAYS           (1<<0)  
#define ATRACE_TAG_GRAPHICS         (1<<1)
#define ATRACE_TAG_INPUT            (1<<2)
#define ATRACE_TAG_VIEW             (1<<3)
#define ATRACE_TAG_WEBVIEW          (1<<4)
#define ATRACE_TAG_WINDOW_MANAGER   (1<<5)
#define ATRACE_TAG_ACTIVITY_MANAGER (1<<6)
#define ATRACE_TAG_SYNC_MANAGER     (1<<7)
#define ATRACE_TAG_AUDIO            (1<<8)
#define ATRACE_TAG_VIDEO            (1<<9)
#define ATRACE_TAG_CAMERA           (1<<10)
#define ATRACE_TAG_HAL              (1<<11)
#define ATRACE_TAG_APP              (1<<12)

Below is part of the log that I captured using atrace.

Some of you may wonder how “Begin” and “End” are matched. To answer this question, first, let’s understand the scenario in which these records are generated. This log is generated in Java using the Trace.traceBegin and Trace.traceEnd methods, and there are some rigid requirements when using them: these two methods must appear in pairs; otherwise, it will cause an exception in the log. Please take a look at the following system code example.

void assignWindowLayers(boolean setLayoutNeeded) {
2401    Trace.traceBegin(Trace.TRACE_TAG_WINDOW_MANAGER, "assignWindowLayers");// Attention: the beginning of this event.
2402    assignChildLayers(getPendingTransaction());
2403    if (setLayoutNeeded) {
2404        setLayoutNeeded();
2405    }

2411    scheduleAnimation();
2412    Trace.traceEnd(Trace.TRACE_TAG_WINDOW_MANAGER);// Event ends here.
2413}
2414

So we can consider that the E immediately following B is the end flag of the event. However, in many cases, we may encounter two consecutive Bs as seen in the log above, and we don’t know which B each of the two following Es corresponds to. In this case, we need to check which CPU generated the event and which task_pid generated the event. This can be seen from the initial part, such as “InputDispatcher-1944”, which allows us to correspond them.

Next, let’s take a look at the supplementary part of the Sample. Its purpose is to let you practice monitoring thread creation and printing the Java method that created the thread. The implementation of the Sample is relatively simple, mainly relying on PLT Hook to hook the main function pthread_create used when creating threads. To complete this Sample, you need to understand how Java threads are created and how they are executed. It is worth noting that this Sample also has a flaw. From the perspective of the virtual machine, threads can be divided into two types: Attached threads, which I prefer to call Managed threads according to the terminology of .Net; and Unattached threads, which are unmanaged threads. However, both are actually implemented based on POSIX Threads, so it is impossible to distinguish whether a thread is a managed thread from pthread_create. It is also possible that the thread is directly opened by Native, so it may not correspond to the Java Stack when creating the thread.

Regarding threads, we may not be very concerned about the status of thread creation in daily monitoring, and distinguishing threads can be achieved by setting the Thread Name in advance. For example, if we find that OOM occurs during the execution of pthread_create, indicating that the current number of threads may be too high, we usually collect the current number of threads and thread stack information when encountering OOM. This allows us to quickly identify the problem if the thread name is specified.

For mobile threads, we are mostly concerned about the execution status of the main thread. Because any time-consuming operation in the main thread will affect the smoothness of the user interface, we often put all seemingly time-consuming operations into child threads. Although this operation may sometimes be effective, it may also cause some exceptional situations that we rarely encounter in our daily work. For example, I have encountered situations where a large number of threads were waiting for I/O due to the low I/O performance of the user’s phone, or where too many threads were created, resulting in high thread context switch. There are also cases where a method executes too slowly, causing the lock to be held for too long and preventing other threads from acquiring the lock, leading to a series of exceptional situations.

Although thread monitoring is not easy, it is not impossible to achieve. It is just more complex to implement and requires consideration of compatibility. For example, we may be more concerned about how many threads are waiting for a lock in a certain Lock. We need to first obtain the MirrorObject of this Object, then construct a MonitorInfo, and then obtain the list of waiters. This list stores the threads waiting for the lock. As you can see, the process is not complicated, but some processing needs to be done when calculating the address offset.

Of course, there are more detailed optimizations. For example, we all know that there is a process of converting lightweight locks to heavyweight locks in Java, which is called ThinLocked and FatLocked in the ART virtual machine. This conversion process is achieved through the functions Monitor::Inflate and Monitor::Deflate. At this time, we can monitor the Object pointed to by the monitor when Monitor::Inflate is called to determine which code segment has the process of converting from “thin lock” to “fat lock”, and then perform some optimizations. To perform the optimization, you need to understand the mechanism of lock conversion in the ART virtual machine. If the current lock is a thin lock, when the thread holding the lock attempts to acquire the lock again, the lock count is incremented without changing the lock state. However, if the lock count exceeds 4096, the thin lock will be converted to a fat lock. If the thread currently holding the lock is not the same as the thread entering MontorEnter, a lock contention situation will occur. In order to reduce the occurrence of fat locks, the ART virtual machine optimizes the process. The virtual machine first uses sched_yield to yield the execution right of the current thread. The operating system will schedule this thread to execute again at some later time. During the time difference between the call to sched_yield and the subsequent execution, the thread that occupies the lock may release the lock, and the calling thread will try to acquire the lock again. If the lock is successfully acquired, the lock will transition directly from the Unlocked state to the ThinLocked state, without entering the FatLocked state. This process continues 50 times. If the lock cannot be acquired within 50 iterations, the thin lock will be converted to a fat lock. If we are sensitive to the performance of a specific area of multi-threaded code, and hope that the lock remains in the thin lock state for as long as possible, we can reduce the granularity of the synchronized block code and minimize the simultaneous competition for locks by many threads. By monitoring the Inflate function calls, we can determine the optimization effect.

Finally, some students are interested in obtaining the Java thread stack in the Crash state. Here, I will briefly explain it. There will be a dedicated article to explain this part later.

One solution is to indirectly achieve it using the ThreadList::ForEach interface. The specific logic can be found here. Another solution is the Unwinder mechanism in Profilo, which is implemented here, and it simulates the logic of StackVisitor to achieve it.

There are not many issues from the feedback these two times, and the answers can also be regarded as supplements to the main text. If you want to learn more about the mechanisms of the virtual machine or other performance-related issues, please feel free to leave a message. I will also discuss these topics with you in future articles. For example, some students have asked about the detailed logic of GC under ART, and I will address this in a separate article.

Goodies #

Today, for the students who have diligently submitted their assignments and completed the exercises, we are giving away the second wave of “Study Boost Packages.” Student @Seven has submitted the assignment for Chapter 5, and we are gifting them a copy of the “Geek Calendar.” Other students, if you have completed the exercises, don’t forget to submit them through a Pull Request.

Feel free to click on “Share with friends” and share today’s content with your friends, inviting them to study together.