35 Native Hook Technology Angel or Demon

35 Native Hook Technology: Angel or Demon #

I believe that readers who have been following this column persistently should be familiar with the term “hook”. I have mentioned hook countless times in previous issues. However, many students may still be confused about what hook really is. So today, let’s start from scratch and learn what hook is.

Literally translated, “hook” refers to a “hooking” mechanism. It intercepts the invocation of a certain API function in a process, redirecting the execution flow of the API to our implemented code snippet, thus achieving the desired functionality. This functionality can range from monitoring and fixing system vulnerabilities to hijacking or implementing other malicious activities.

I believe that many beginners may find hooking technology quite mysterious and feel that it can only be mastered by a few experts or hackers. Is hooking really that difficult to grasp? I hope that today’s article can dispel your concerns.

Different Schools of Native Hook #

For the Native Hook technology, we are familiar with three different schools: GOT/PLT Hook, Trap Hook, and Inline Hook. Now let’s discuss the implementation principles and pros and cons of these Hook technologies one by one.

1. GOT/PLT Hook

In Chapter06-plus, we used PLT Hook technology to retrieve the stack when a thread is created. Let’s review the entire process. We replace the external function pthread_create in libart.so with our own function pthread_create_hook.

As you can see, the main purpose of GOT/PLT Hook is to replace external calls of a certain shared library with our target function. GOT/PLT Hook can be considered a very classic Hook method. It is highly stable and can meet the standards for deployment in production environments.

But what is the underlying principle of the GOT/PLT Hook implementation? First, you need to have a understanding of the ELF file format of the shared library and the dynamic linking process.

ELF Format

ELF (Executable and Linking Format) is an executable and linking format used by most executable files for various UNIX systems. Although ELF files themselves support three different types (relocation, executable, and shared), they have a unified structure, as shown in the following diagram.

There are numerous articles online discussing the ELF file format. You can refer to ELF File Format Explanation (in Chinese) for more information. For the sake of the GOT/PLT Hook, we mainly focus on the following sections in ELF files:

.plt. This section contains the Procedure Linkage Table (PLT).
.got. This section contains the Global Offset Table (GOT).

We can also use readelf -S to check the specific information of an ELF file.

Dynamic Linking Process

Next, let’s take a look at the dynamic linking process. When we need to use a shared library (a .so file), we need to call dlopen("libname.so") to load the library.

After we call dlopen("libname.so"), the system first checks the list of ELF files already loaded in the cache. If the library is not loaded, the loading process is executed. If the library is already loaded, the reference count is increased by one, and the call is ignored. Then, the system reads the libraries that the libname.so depends on from the dynamic section of libname.so, and adds the libraries to the loading list using the same loading logic for libraries not already in the cache.

You can use the following command to view the dependencies of a library:

readelf -d <library> | grep NEEDED

Now let’s briefly understand how the system loads ELF files.

Read the program header table of the ELF file and mmap all the sections with PT_LOAD into memory.
Read various information items from the .dynamic section, calculate and save the virtual addresses of all sections, and perform relocation operations.
Finally, the ELF file is successfully loaded and the reference count is increased by one.

However, there is a key point here. In the ELF file format, we only have the absolute addresses of functions. If we want to run them in the system, relocation is required. This is actually a complex problem because different machine CPU architectures and loading orders require us to calculate this value at runtime. Fortunately, the dynamic loader ("/system/bin/linker") helps us solve this problem.

If you understand the dynamic linking process, let’s go back and think about the specific meanings of the .got and .plt sections.

The Global Offset Table (GOT). It is a table in the data section. Assuming we have some instructions in the code section that refer to certain address variables, the compiler will use the GOT table to replace direct references with absolute addresses. This is because absolute addresses cannot be known at compile-time and can only be obtained after relocation. The GOT itself will also contain absolute addresses of function references.
The Procedure Linkage Table (PLT). Unlike the Global Offset Table (GOT), the PLT is located in the code segment. Each external function of a dynamic library has a record in the PLT, and each PLT record is a small piece of executable code. Generally, external code calls the records in the PLT, and the corresponding record in the PLT is responsible for calling the actual function. We generally refer to this setup as a “trampoline”.

PLT records correspond one-to-one with GOT records, and after the GOT table is first resolved, it contains the actual addresses of the called functions. In that case, what is the significance of the PLT table? In a sense, the PLT gives us lazy-loading capabilities. When a dynamic library is first loaded, all function addresses are not resolved. Let’s analyze the first function call in detail with the help of a diagram. Please note that the black arrows in the diagram represent jumps, and the purple color represents pointers.

When we call func in the code, the compiler will translate it to func@plt and insert a record in the PLT table.
The first (or zeroth) record in the PLT table, PLT[0], is a special record used to help us resolve addresses. Usually, in Linux-like systems, this implementation is located in the dynamic loader, which is mentioned in the previous articles as /system/bin/linker.
The remaining PLT records all contain the following information:
- A instruction to jump to the GOT table (jmp *GOT[n]).
- Parameters prepared for the address-resolution function mentioned before the zeroth record.
- A call to PLT[0], where the actual address of the resolver is stored in GOT[2].
Before resolution, GOT[n] directly points to the next instruction after jmp *GOT[n]. After resolution is complete, we obtain the actual address of func, and the dynamic loader fills in this address in GOT[n], then calls func.

If you have any questions about the above calling process, you can refer to the article, “GOT Table and PLT Table”, which contains a clear diagram.

Once the first call occurs, subsequent calls to the func function become more efficient and straightforward. We first call PLT[n], and then execute jmp *GOT[n]. GOT[n] directly points to func, thus efficiently completing the function call.

To summarize, because many functions may not be used after the program is executed, such as error handling functions or rarely used user modules, it is a waste to link all functions initially. To improve the performance of dynamic linking, we can use the PLT to achieve lazy binding.

The actual address of the function for execution is still obtained through the GOT table. The simplified process is as follows:

Now that you have reached this point, I believe you already have a preliminary idea of how to hack this process. In the industry, people usually differentiate between GOT Hooks and PLT Hooks based on whether the PLT record or the GOT record is modified, but the underlying principles are very similar.

GOT/PLT Hook Practice

Although GOT/PLT Hook seems simple, there are still some pitfalls in its implementation, and compatibility needs to be considered. Generally, it is recommended to use mature solutions in the industry.

The ELF Hook library in WeChat’s Matrix open-source library uses GOT Hooks primarily for performance monitoring.
The xHook library open-sourced by iQIYI also uses GOT Hooks.
Facebook’s PLT Hook.

If you do not want to delve into the internals of these solutions, you can directly use these excellent open-source frameworks. Because this Hook method is very mature and stable, there are many other usage examples besides creating Hook threads.

In the “I/O optimization,” the matrix-io-canary Hook file operations are used.
In the “Network optimization,” Socket operations are hooked. You can refer to Chapter17 for more details.

This Hook method is not foolproof, because it can only replace the way import functions are called. Sometimes we may not be able to find such an external call function. If you want to hook the internal calls of the function, you need to use our Trap Hook or Inline Hook.

2. Trap Hook

For internal function hooking, you can think about it for a moment and you will find that the debugger has all the capabilities of a Hook framework. It can interrupt the program before the target function, modify memory and program segments, and then continue execution. I believe many students will use debuggers, but they know very little about how debuggers work. Let’s first understand how software debuggers work.

ptrace

Typically, software debuggers use the ptrace system call and SIGTRAP to perform breakpoint debugging. First, let’s understand what ptrace is and how it interrupts program execution and modifies related execution steps.

A qualified low-level programmer first uses the “man” command to view the system documentation for unfamiliar knowledge.

The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

This paragraph translates to: ptrace provides a way for one program (the tracer) to observe or control the execution flow of another program (the tracee), and modify the memory and registers of the controlled program. It is primarily used to implement breakpoint debugging and system call tracing.

Let’s also briefly understand how debuggers (GDB/LLDB) use ptrace. First, the debugger will determine whether to use fork or attach to the target process based on whether the process to be debugged has started. Once the debugger is bound to the target program, any signal from the target program (except SIGKILL) will be intercepted by the debugger, and the debugger will have a chance to handle the relevant signal before handing over the execution permission to the target program to continue execution. You may have already noticed that this has actually achieved the purpose of Hooking.

How to Hook

However, if we further consider that we do not need to modify memory or do complex interactions like a debugger, we can completely rely on receiving the relevant signals. This is where we think of the signal handler. Yes! We can actively raise a signal and then use the signal handler to achieve similar Hooking effects.

In the industry, Trap Hooking is also called breakpoint Hooking, and its principle is to find a way to trigger a breakpoint in the place where we need to Hook, and capture the exception. Generally, we will use either SIGTRAP or SIGKILL (illegal instruction exception) signals. Take the SIGTRAP signal as an example. The specific implementation steps are as follows.

Register the signal handler. Different architectures may choose different signals. Here we use SIGTRAP.
Insert a Trap instruction at the part where we need to Hook.
Call the Trap instruction, enter kernel mode, and call the signal handler that we have registered at the beginning.
Execute our signal handler. Here, please note that all code executed in the signal handler needs to ensure that it is async-signal-safe. Here, we can simply treat the signal handler as a trampoline and use longjmp to jump out of the environment that needs to be async-signal-safe (as I mentioned in the “Crash Analysis”, some functions used in signal callbacks are not safe), and then execute our Hook code.
After executing the Hook function, we need to restore the context. If we want to continue calling the original function A, we can directly write back the original instructions of function A and restore the register status.

Trap Hook Practice

Trap Hook has very good compatibility and can be used on a large scale in production environments. However, its biggest problem is that it is relatively inefficient and is not suitable for functions that are called very frequently.

For the practical aspects of Trap Hooking, in “Performance Optimization for Stuttering (Part 2)”, I mentioned Facebook’s Profilo, which uses periodic sending of SIGPROF signals to achieve stutter monitoring.

3. Inline Hook

Like Trap Hooking, Inline Hooking is also a way to hook internal function calls. It directly replaces the instructions at the beginning (Prologue) of the function with a jump instruction, so that the original function directly jumps to the target function of the Hook, and preserves the calling interface of the original function to achieve the purpose of subsequent re-calling.

Compared with GOT/PLT Hooking, Inline Hooking can hook almost any function without the limitations of the GOT/PLT table. However, its implementation is very complex, and I have not seen any implementation that can be used in production environments so far. And on ARM architectures, it is impossible to hook leaf functions and very short functions.

Before diving into the “evil” details, we need to have a basic understanding of the general process of Inline Hooking.

As shown in the figure, the basic idea of Inline Hooking is to insert a jump instruction in the existing code segment to redirect the execution flow of the code to our implemented Hook function, and then perform instruction patching and jump back to the original function to continue execution. Doesn’t this description look very simple and clear? For Trap Hook, we only need to insert special instructions before the target address and write back the original instructions after execution ends. But for Inline Hook, it directly performs instruction-level overwriting and repair. How should we understand this? It’s like modifying ASM bytecode during runtime.

Of course, Inline Hook is much more complex than ASM operations because it also involves instruction set adaptation issues brought by different CPU architectures. We need to perform instruction overwriting and jumping based on different instruction sets.

Now let me briefly explain the common CPU architectures and instruction sets for Android:

x86 and MIPS architectures: These two architectures have very few users left, so we can ignore them. Generally, we only need to focus on the mainstream ARM architecture.
ARMv5 and ARMv7 architectures: The instruction set in ARMv7 consists of 4-byte aligned fixed-length ARM instructions and 2-byte aligned variable-length Thumb/Thumb-2 instructions. Although Thumb-2 instructions are 2-byte aligned, the instruction set itself has both 16-bit and 32-bit instructions. ARMv5 uses 16-bit Thumb16, while ARMv7 uses 32-bit Thumb32. However, ARMv5 has barely any users left, so we can abandon the adaptation of Thumb16 instruction set.
ARMv8 architecture: The 64-bit ARMv8 architecture is compatible with 32-bit, so it adds the ARM64 instruction set on the basis of ARM32 and Thumb32 instruction sets. For specific differences, you can refer to ARM’s official documentation.

I haven’t adapted ARM64 yet, but Google Play requires all apps to support 64-bit by August 1, 2019, so we also need to deal with it in the first half of this year. However, their principles are basically the same. Now, let me explain Inline Hook using the most mainstream ARMv7 architecture as an example.

ARM32 Instruction Set

In ARMv7, there is a widely circulated understanding that $PC=$PC+8. This means that in ARMv7’s three-stage pipeline (fetch, decode, execute), the $PC register always points to the instruction being fetched, not the one being executed. The instruction fetch stage is always 2 instructions ahead of the execution stage. In the ARM32 instruction set, the length of 2 instructions is 8 bytes, so the value of $PC register is always 8 greater than the current instruction address.

ARMv7 Three-stage Pipeline

You might feel a little bit complicated, but this is to introduce the commonly used jump methods in the ARM instruction set:

LDR PC, [PC, #-4] ;0xE51FF004
$TRAMPOLIN_ADDR

After understanding the three-stage pipeline, you should have no doubts about PC-4.

Following the basic steps of Inline Hook as described earlier, we first insert a jump instruction to enter our trampoline, where we execute the hooked function. There is an “evil” detail here. Since instruction execution depends on the current runtime environment, i.e., the values of all registers, the inserted instructions may change the register states. Therefore, we need to save the current states of all registers to the stack, use the BLX instruction to jump and execute the hooked function, restore all registers from the stack after execution ends, and finally continue executing the original function as if it were not hooked.

Inline Hook Trampoline and Original Function Execution

After executing the hooked function, we need to jump back to the original function to continue execution. Here, don’t forget the LDR instruction we covered at the beginning. We need to execute the instructions that we have overwritten first, and then use the following instructions to continue executing the original function:

LDR PC, [PC, #-4]
HOOKED_ADDR+8

Do you feel a sense of accomplishment? Actually, there is a huge pit waiting for us here, which is instruction repair. I mentioned earlier that we saved and restored the original register states to ensure that we can continue executing like the original program. But is it enough to only restore registers? Obviously not. Although the registers have been perfectly restored, the two backup instructions have been moved to new addresses. When they are executed, the value of the $PC register is different from before. If the operation of these instructions involves the value of $PC, they will produce completely different results.

I won’t go into more details about instruction repair here. If you are interested, you can discuss it in the comment section.

Inline Hook Practice

Although Inline Hook is very powerful and efficient, there is currently no completely stable and reliable open-source solution in the industry. Inline Hook is usually used in automation testing or troubleshooting difficult problems in production environments. For example, in the “UI optimization” case mentioned before, we used Inline Hook to collect system information related to the crash of libhwui.so.

There are also some good reference solutions in the industry:

Cydia Substrate. We used it to hook system memory allocation functions in Chapter 3.
adbi. A Hook framework used by Alipay in the GC suppression case, but it hasn’t been updated for several years.

Comparison of advantages and disadvantages of different genres #

Finally, let’s summarize the advantages and disadvantages of different hooking methods:

GOT/PLT Hook is a relatively moderate solution with good performance and moderate implementation difficulty. However, it can only hook functions called between dynamic libraries and cannot hook unexported private functions. Moreover, it only has two states: installed and uninstalled. Once installed, it will hook all function calls.
Trap Hook is the most stable, but it has poor performance because it requires switching between running modes (R0/R3) and relies on the kernel’s signal mechanism.
Inline Hook is a very aggressive solution with very good performance and no PLT scope limitation. It can be said to be a very flexible and perfect solution. However, its implementation difficulty is extremely high, and I have not seen any Inline Hook solution that can be deployed in production environment so far, as it involves instruction patching and requires various optimizations from the compiler.

But it is important to note that no matter which hooking method is used, it can only hook the processes of the application itself. We cannot replace the execution of functions in the system or other application processes.

Summary #

In summary, Native Hook is a very low-level technology that involves various aspects such as library file compilation, loading, and linking. Furthermore, many of the underlying knowledge is unrelated to Android or even mobile platforms.

In this field, those who specialize in security may have more say, and I might just be stating the obvious. However, I hope that through this article, you have gained a general understanding of the seemingly mysterious technology of Hooking. I hope that you can use Hooking in your daily work to accomplish tasks that may seem impossible, such as fixing system bugs or monitoring native memory allocation.

Homework #

Wasn’t the amount of information today a bit overwhelming? What are your thoughts on Native Hook and what questions do you have? Feel free to leave a comment and discuss with me and other classmates.

Native Hook technology is indeed very complex. Even if we don’t understand its internal principles, we should still learn how to use mature open-source frameworks to implement certain functionalities. Of course, for students who are interested in further research, I recommend studying the following materials.

If you are also very interested in debugger research, I highly recommend the blog written by Eli Bendersky. There are a series of excellent low-level knowledge articles. Some of them are about debuggers. Interested students can read them and implement a simple debugger themselves.

Feel free to click on “Please share with friends” to share today’s content with friends and invite them to learn together. Lastly, don’t forget to submit today’s homework in the comments section. I have also prepared a generous “study encouragement gift” for students who complete the homework seriously. Looking forward to learning and improving together with you.