70 Strategies for Solving Deadlock Issues

70 Strategies for Solving Deadlock Issues #

In this lesson, we mainly introduce the strategies to solve deadlocks.

What to do when a deadlock occurs online #

If a deadlock occurs in the online environment, the adverse consequences have already occurred. The best time to fix a deadlock is to “prevent it from happening in the first place” rather than trying to remedy it afterwards. It’s like a fire, once it’s already burning, it’s almost impossible to put out and avoid losses. Deadlocks are the same. If a deadlock occurs online, the best way to minimize losses is to save JVM information, logs, and other “crime scene” data, and then restart the service immediately to try to fix the deadlock. Why is it said that restarting the service can solve this problem? Because there are often many prerequisites for deadlocks to occur, and deadlocks are more likely to occur when the concurrency is high enough. So the probability of a deadlock occurring immediately after a restart is not very high. After we restart the server, we can temporarily ensure the availability of the online service, and then use the saved crime scene information to investigate the deadlock, modify the code, and ultimately re-release.

Common repair strategies #

What are the common repair strategies for deadlocks? Below, we will introduce three main repair strategies:

Avoidance strategy
Detection and recovery strategy
Ostrich algorithm

Each of them has different focuses. Let’s start with the avoidance strategy.

Avoidance strategy #

How to avoid

The main idea of the avoidance strategy is to optimize the code logic to fundamentally eliminate the possibility of deadlocks. Generally speaking, one of the main reasons for deadlocks is acquiring different locks in reverse order. Hence, we will demonstrate how to avoid deadlocks by adjusting the order of lock acquisition.

Avoiding deadlocks during money transfer

Let’s first look at the situation where a deadlock occurs during money transfer. This example is illustrative and is written for the purpose of learning about deadlocks, so it is very different from the design of a real bank system. But it doesn’t matter because our focus is on how to avoid deadlocks, not the business logic of transfers.

(1) Deadlock occurs

In our money transfer system, to ensure thread safety, we need to first acquire two locks (two lock objects) before transferring funds: one for the account from which money is being transferred, and the other for the account to which money is being transferred. Without this restriction, other threads may modify the balance during the period when a thread is modifying it, which may lead to thread safety issues. Therefore, the balance cannot be operated on until these two locks are acquired; and only after acquiring these two locks can the actual money transfer operation be performed. Of course, if the amount to be transferred is greater than the balance of the account, the transfer is not allowed because it is not allowed for the balance to become negative.

However, there is a potential for deadlock during this process. Let’s take a look at the code:

public class TransferMoney implements Runnable {

    int flag;

    static Account a = new Account(500);

    static Account b = new Account(500);

    static class Account {

        public Account(int balance) {

            this.balance = balance;

        }

        int balance;

    }

    @Override

    public void run() {

        if (flag == 1) {

            transferMoney(a, b, 200);

        }

        if (flag == 0) {

            transferMoney(b, a, 200);

        }

    }

    public static void transferMoney(Account from, Account to, int amount) {

        // Get the two locks first, and then start the money transfer

        synchronized (to) {

    public static void transferMoney(Account from, Account to, int amount) {
    
        synchronized (to) {
    
            try {
    
                Thread.sleep(500);
    
            } catch (InterruptedException e) {
    
                e.printStackTrace();
    
            }
    
            synchronized (from) {
    
                if (from.balance - amount < 0) {
    
                    System.out.println("余额不足，转账失败。");
    
                    return;
    
                }
    
                from.balance -= amount;
    
                to.balance += amount;
    
                System.out.println("成功转账" + amount + "元");
    
            }
    
        }
    
    }

再次运行代码，发现程序并未正常执行。在控制台输出中并没有打印任何信息，程序可能发生了死锁。出现这种情况是因为线程t1和t2分别持有了账户a和b的锁，并且它们都在等待对方的锁，因此形成了死锁。

要解决死锁问题，我们可以通过按照相同的顺序获取锁来避免。在本例中，我们可以先获取a的锁，再获取b的锁。我们可以修改transferMoney方法如下：

    public static void transferMoney(Account from, Account to, int amount) {
    
        Account left = from;
    
        Account right = to;
    
        //根据账号的hashCode决定锁的获取顺序
        if (from.hashCode() > to.hashCode()) {
    
            left = to;
    
            right = from;
    
        }
    
        synchronized (left) {
    
            try {
    
                Thread.sleep(500);
    
            } catch (InterruptedException e) {
    
                e.printStackTrace();
    
            }
    
            synchronized (right) {
    
                if (from.balance - amount < 0) {
    
                    System.out.println("余额不足，转账失败。");
    
                    return;
    
                }
    
                from.balance -= amount;
    
                to.balance += amount;
    
                System.out.println("成功转账" + amount + "元");
    
            }
    
        }
    
    }

通过上述修改，我们保证了无论哪个线程先获取锁，它们都是按照相同的顺序去获取锁的。这样就可以避免发生死锁了。

你可以尝试用该方法修改你的代码，以避免死锁的情况发生。 ```java synchronized (to) {

        try {

            Thread.sleep(500);

        } catch (InterruptedException e) {

            e.printStackTrace();

        }

        synchronized (from) {

            if (from.balance - amount < 0) {

                System.out.println("Insufficient balance, transfer failed.");

                return;

            }

            from.balance -= amount;

            to.balance += amount;

            System.out.println("Successfully transferred " + amount + " yuan");

        }

    }
    ```
    
    As you can see, the change in `transferMoney` is that, between the two `synchronized` blocks, which means after acquiring the first lock and before acquiring the second lock, we added a 500 milliseconds sleep statement. Now, when running the program, there is a high probability of a deadlock occurring, causing **no statement to be printed in the console and the program will not stop**.
    
    Let's analyze why deadlock occurs. The main reason is that the two different threads obtain the two locks in reverse order (the two accounts obtained by the first thread are exactly the accounts obtained by the second thread, and the "transfer out account" of the first thread is the "transfer in account" of the second thread), so we can start from the perspective of this "reverse order" to solve the deadlock problem.
    
    **(2) Actually don't care about the order of obtaining locks**
    
    After thinking, we can find that the order in which the two locks are obtained actually doesn't matter when transferring money. Whether we first obtain the transfer out account lock object or the transfer in account lock object, as long as we can eventually obtain the two locks, we can perform the operation safely. So let's adjust the order of obtaining locks so that the account obtained first is not related to whether it is "transfer in" or "transfer out", but is determined by the value of HashCode, thereby ensuring thread safety.
    
    Here is the fixed `transferMoney` method:
    
    ```java
    public static void transferMoney(Account from, Account to, int amount) {

        int fromHash = System.identityHashCode(from);

        int toHash = System.identityHashCode(to);

        if (fromHash < toHash) {

            synchronized (from) {

                synchronized (to) {

                    if (from.balance - amount < 0) {

                        System.out.println("Insufficient balance, transfer failed.");

                        return;

                    }

                    from.balance -= amount;

                    to.balance += amount;

                    System.out.println("Successfully transferred " + amount + " yuan");

                }

            }

        } else if (fromHash > toHash) {

            synchronized (to) {

                synchronized (from) {
    ```

if (from.balance - amount < 0) {

    System.out.println("Insufficient balance, transfer failed.");

    return;

}

from.balance -= amount;

to.balance += amount;

System.out.println("Transfer successful: " + amount + " yuan");

}

As you can see, we will calculate the HashCode of these two Accounts separately, and then decide the order of obtaining the lock based on the size of the HashCode. In this way, no matter which thread executes first, no matter whether it is transferred out or transferred in, the order of obtaining the lock will be strictly determined according to the value of the HashCode, so that everyone will obtain the lock in the same order, and there will be no situation where the order of obtaining the lock is reversed, thus avoiding deadlock.

(3) It is safer and more convenient to have a primary key

Let’s take a look at the way to determine the lock acquisition order using the primary key, which is safer and more convenient. Just now we used HashCode as the sorting criterion because HashCode is more universal and every object has it, but there is still a very small probability that the same HashCode will occur. In practical production, what needs to be sorted is often an entity class, and an entity class generally has a primary key ID, which has the characteristics of unique and non-repetitive. Therefore, if our class contains the primary key attribute, it will be more convenient. We don’t need to calculate the HashCode. We can directly use its primary key ID to sort and determine the order of obtaining the lock, which can ensure the avoidance of deadlock.

Above, we have introduced deadlock avoidance strategies.

Detection and recovery strategies #

Now let’s look at the second strategy, which is the detection and recovery strategy.

What is a deadlock detection algorithm?

It is different from the previous deadlock avoidance strategy. The deadlock avoidance strategy is to prevent deadlock from occurring through logic, while the detection and recovery strategy here allows deadlock to occur first and then release it. For example, the system can record the call information every time it calls a lock, forming a “lock call chain diagram”. Then, at regular intervals, a deadlock detection algorithm can be used to check whether there is a cycle in this graph. Once a deadlock occurs, a deadlock recovery mechanism can be used, such as depriving a resource to unlock the deadlock and perform recovery. Therefore, its approach is very different from the previous deadlock avoidance strategy.

How to unlock a deadlock after it is detected? Method 1: Thread Termination

The first method to resolve deadlocks is to terminate the threads (or processes). In this case, the system will terminate the threads that are already in a deadlock one by one. When a thread is terminated, it releases the resources it holds, which resolves the deadlock.

Of course, the order of termination should be considered carefully. Generally, the following factors are taken into account:

(1) Priority

Normally, the priority of the thread or process is considered during termination. Threads with lower priority are terminated first. For example, foreground threads are involved in user interface display, which is important for users. Therefore, foreground threads often have higher priority than background threads.

(2) Resources Currently Held and Resources Needed

The resources held by a thread and the resources it still needs are also taken into account. If a thread has already acquired many resources and only needs a few more to complete its task successfully, the system may not prioritize terminating that thread. Instead, it may choose to terminate other threads to facilitate the completion of the one with fewer remaining resource needs.

(3) Elapsed Running Time

Another consideration factor is the elapsed running time. For example, if a thread has been running for several hours or even days and is close to completing its task, terminating that thread may not be a wise choice. We can terminate the threads that have just started and restart them later, which is a lower cost solution.

There are various algorithms and strategies that can be used, and they can be adjusted according to the actual business needs.

Method 2: Resource Preemption

The second method to resolve deadlocks is resource preemption. Instead of terminating the entire thread, we only need to revoke the resources already acquired by the thread. For example, the thread can be rolled back several steps and release the resources. This way, we don’t need to terminate the entire thread, resulting in lower consequences and costs.

However, this approach has a drawback. If the algorithm is not well-designed, the thread being preempted may be the same thread continuously, leading to thread starvation. This means that the thread is constantly deprived of the resources it has acquired, resulting in long-term inability to run.

These are the strategies for deadlock detection and recovery.

Ostrich Algorithm #

Now let’s also take a look at the ostrich algorithm. The ostrich algorithm is named after the behavior of ostriches. When faced with danger, an ostrich buries its head in the sand, so it can’t see the danger anymore.

The meaning of the ostrich algorithm is that if the likelihood of deadlock in our system is not high and the consequences are not very serious when it occurs, we can choose to ignore it at first. We can manually fix it when the deadlock actually occurs, for example, by restarting the service. This approach is not unacceptable. If our system is used by a small number of users, such as an internal system, and the concurrency is extremely low, it may not experience deadlocks for years. In this case, considering the cost-effectiveness, there is no need to handle deadlock issues specially. This needs to be reasonably chosen based on our business scenario.

Summary #

In this lesson, we mainly introduced various strategies to solve deadlocks. Firstly, we mentioned that when a deadlock occurs in a production environment, the priority should be to save important data and then restore the online services. Then, we introduced three specific recovery strategies: avoidance strategy, which aims to change the order of acquiring locks to prevent the occurrence of reverse order lock acquisition; detection and recovery strategy, which allows deadlocks to occur but has solutions to resolve them; and the ostrich algorithm.