14 Retry Mechanism Is the Basic Guarantee of Network Operations

14 Retry Mechanism is The Basic Guarantee of Network Operations #

In real-world microservice systems, service discovery components such as ZooKeeper and etcd are generally deployed as a separate cluster. Business services connect to these service discovery nodes over the network to perform registration and subscription operations. However, even in a stable network within a data center, there is no guarantee that requests between two nodes will always succeed. Therefore, RPC frameworks like Dubbo face significant challenges in terms of stability and fault tolerance. To ensure service reliability, retry mechanism becomes essential.

The so-called “retry mechanism” means that when a request fails, the client reissues an identical request in an attempt to call the same or a different server to complete the corresponding business operation. The business interface that can use the retry mechanism must be “idempotent”, which means that no matter how many times the request is sent, the result is always the same, such as a query operation.

Core design #

In the previous lesson, we introduced the core operations in AbstractRegistry, such as register()/unregister(), subscribe()/unsubscribe(), and notify(), and analyzed the fault tolerance functionality implemented through local caching. In fact, these core methods are also the focus of the retry mechanism.

dubbo-registry puts the implementation of the retry mechanism in a subclass of AbstractRegistry called FailbackRegistry. As shown in the diagram below, the Registry implementations that integrate with open-source service discovery components like ZooKeeper and etcd inherit from FailbackRegistry, which gives them the ability to retry failed operations.

Registry inheritance relationship.png

The core design of FailbackRegistry is as follows: it overrides the five core methods in AbstractRegistry: register()/unregister(), subscribe()/unsubscribe(), and notify(). Combined with the timed wheel introduced earlier, it implements the ability to retry failed operations. The real interaction with the service discovery component is delegated to the five abstract methods: doRegister()/doUnregister(), doSubscribe()/doUnsubscribe(), and doNotify(), which are implemented by concrete subclasses. This is a typical application of the template method pattern.

Description of core fields #

The first step in analyzing an implementation class is to understand its core fields. So, what are the core fields of FailbackRegistry?

retryTimer (HashedWheelTimer type): A timed wheel used to schedule retry operations.
retryPeriod (int type): The time interval for retry operations.
failedRegistered (ConcurrentMap type): A collection of registered URLs that failed to be registered, where the key is the URL that failed to be registered and the value is the corresponding retry task.
failedUnregistered (ConcurrentMap type): A collection of URLs that failed to be unregistered, where the key is the URL that failed to be unregistered and the value is the corresponding retry task.
failedSubscribed (ConcurrentMap type): A collection of URLs that failed to be subscribed, where the key is the URL + Listener combination that failed to be subscribed and the value is the corresponding retry task.
failedUnsubscribed (ConcurrentMap type): A collection of URLs that failed to be unsubscribed, where the key is the URL + Listener combination that failed to be unsubscribed and the value is the corresponding retry task.
failedNotified (ConcurrentMap type): A collection of URLs that failed to be notified, where the key is the URL + Listener combination that failed to be notified and the value is the corresponding retry task.

In the constructor of FailbackRegistry, it first calls the constructor of the parent class AbstractRegistry to initialize the local caching, then it initializes the retryPeriod field by obtaining the retry period parameter (retry.period) from the URL parameter, and finally initializes the retryTimer timed wheel. The code for this process is relatively simple and will not be shown here.

Analysis of core method implementations #

The specific implementations of the register()/unregister() methods and the subscribe()/unsubscribe() methods in FailbackRegistry are very similar. Therefore, we will only introduce the specific implementation process of the register() method here.

Based on the matching pattern specified by the accepts parameter in registryUrl, decide whether to accept the current Provider URL to be registered.
Call the register() method of the parent class AbstractRegistry to add the Provider URL to the registered collection.
Call the removeFailedRegistered() and removeFailedUnregistered() methods to remove the Provider URL from the failedRegistered and failedUnregistered collections, and stop the corresponding retry tasks.
Call the doRegister() method to interact with the service discovery component. This method is implemented by subclasses, with each subclass responsible for integrating with a specific service discovery component.
When an exception occurs in the doRegister() method, it is classified based on the URL parameter and the type of the exception: if the check parameter of the URL to be registered is true (the default value is true), if the URL to be registered is not a consumer protocol, and if the check parameter of the registryUrl is also true (the default value is true). If these three conditions are met or if the exception thrown is SkipFailbackWrapperException, the exception is thrown directly. Otherwise, a retry task will be created and added to the failedRegistered collection.

Now that the core process of the register() method is clear, let’s take a look at the specific implementation code of the register() method:

public void register(URL url) {

    if (!acceptable(url)) { 

        logger.info("..."); // Print relevant log messages

        return;
    }

    super.register(url); // Complete the initialization of local file caching

    // Clean up the failedRegistered collection and failedUnregistered collection, and cancel related tasks

    removeFailedRegistered(url); 

    removeFailedUnregistered(url);

    try {

        doRegister(url);  // Interact with the service discovery component, implemented by subclasses

    } catch (Exception e) {

        Throwable t = e;

        // Check the check parameter to decide whether to throw the exception directly

        boolean check = getUrl().getParameter(Constants.CHECK_KEY,

               true) && url.getParameter(Constants.CHECK_KEY, true)

                && !CONSUMER_PROTOCOL.equals(url.getProtocol());

        boolean skipFailback = t instanceof 

            SkipFailbackWrapperException;

        if (check || skipFailback) {

            if (skipFailback) {

t = t.getCause();

}

throw new IllegalStateException("Failed to register");

}

// If no exception is thrown, create a retry task and add it to the failedRegistered set

addFailedRegistered(url);

}

protected void reput(Timeout timeout, long tick) {

    if (timeout == null) { // Boundary check

        throw new IllegalArgumentException();

    }

    Timer timer = timeout.timer(); // Check the timer task

    if (timer.isStop() || timeout.isCancelled() || isCancel()) {

        return;

    }

    times++; // Increment times

    // Add timer task

    timer.newTimeout(timeout.task(), tick, TimeUnit.MILLISECONDS);

}

AbstractRetryTask defines the method doRetry() as an abstract method, leaving it to be implemented by subclasses. This is an application of the template method pattern. In the implementation of the doRetry() method in the subclass FailedRegisteredTask, the doRegister() method associated with the Registry is executed again to interact with the service discovery component. If registration is successful, the removeFailedRegisteredTask() method is called to remove the associated URL and the current retry task from the failedRegistered collection. If registration fails, an exception is thrown, and the retry is performed by invoking the reput() method introduced earlier.

protected void doRetry(URL url, FailbackRegistry registry, Timeout timeout) {

    registry.doRegister(url); // Retry registration

    registry.removeFailedRegisteredTask(url); // Remove the retry task

}

public void removeFailedRegisteredTask(URL url) {

    failedRegistered.remove(url);

}

Additionally, in the entry point of the register() method, the removeFailedRegistered() method and the removeFailedUnregistered() method are called to clean up the specified URL-associated timer tasks:

public void register(URL url) {

    super.register(url);

    removeFailedRegistered(url); // Clean up FailedRegisteredTask timer tasks

    removeFailedUnregistered(url); // Clean up FailedUnregisteredTask timer tasks

    try {

        doRegister(url);

    } catch (Exception e) {

        addFailedRegistered(url);

    }

}

Other Core Methods #

The implementations of the unregister() and unsubscribe() methods are similar to the register() method, except for the different do*() abstract methods and the dependent AbstractRetryTask. We will not go into further detail here.

Do you remember in the previous lesson we introduced the fault-tolerant mechanism implemented by AbstractRegistry through local file caching? In the FailedbackRegistry.subscribe() method, when handling exceptions, it first retrieves the cached subscription data and calls the notify() method. If there is no cached subscription data, it will then check the check parameter to determine whether to throw an exception.

Based on the previous introduction to the notify() method in AbstractRegistry, we can see that one of its core logics is to callback the NotifyListener. Now let’s take a look at how FailbackRegistry overrides the notify() method:

protected void notify(URL url, NotifyListener listener, 
        List<URL> urls) {

    ... // Check that url and listener are not null (omitted)

    try {

        // The FailbackRegistry.doNotify() method actually calls the parent class
        
        // AbstractRegistry.notify() method without any additional logic

        doNotify(url, listener, urls); 

    } catch (Exception t) {

        // If an exception occurs in the doNotify() method, a timer task will be added

        addFailedNotified(url, listener, urls);

    }

}

The addFailedNotified() method will create the corresponding FailedNotifiedTask and add it to the failedNotified collection, and it will also be added to the timer wheel for execution. If the corresponding FailedNotifiedTask retry task already exists, the task will be updated with the URL set that needs to be processed.

The FailedNotifiedTask maintains a URL set, which is used to record the URLs that need to be notified during each task execution. After each task execution, the set will be cleared. The specific implementation is as follows:

protected void doRetry(URL url, FailbackRegistry registry, 
        Timeout timeout) {

    // If the urls collection is empty, all listeners will be notified, and the task will do nothing
    
    if (CollectionUtils.isNotEmpty(urls)) { 

        listener.notify(urls);

        urls.clear();

    }

    reput(timeout, retryPeriod); // Add the task back to the timer wheel for execution

}

From the above code, it can be seen that once the FailedNotifiedTask retry task is added, it will continue to run indefinitely. But is it really the case? In the subscribe() and unsubscribe() methods of FailbackRegistry, you can see the call to the removeFailedNotified() method, which is the place where FailedNotifiedTask tasks are cleaned up. Let’s take FailbackRegistry.subscribe() as an example to illustrate:

public void subscribe(URL url, NotifyListener listener) {

    super.subscribe(url, listener);

    removeFailedSubscribed(url, listener); // Pay attention to this method

    try {

        doSubscribe(url, listener);

    } catch (Exception e) {

        addFailedSubscribed(url, listener);

    }

}

// The removeFailedSubscribed() method will clean up FailedSubscribedTask, FailedUnsubscribedTask, FailedNotifiedTask timer tasks

private void removeFailedSubscribed(URL url, NotifyListener listener) {

    Holder h = new Holder(url, listener); // Clean up FailedSubscribedTask

    FailedSubscribedTask f = failedSubscribed.remove(h);

    if (f != null) {

        f.cancel();

    }

    removeFailedUnsubscribed(url, listener);// Clean up FailedUnsubscribedTask

    removeFailedNotified(url, listener); // Clean up FailedNotifiedTask

}

After introducing the core implementation of registration/subscription in FailbackRegistry, let’s now focus on its recovery feature, namely the recover() method. This method directly uses the FailedRegisteredTask task to process all URLs in the registered collection and the associated NotifyListener for URLs in the subscribed collection.

When the lifecycle of FailbackRegistry ends, it calls its own destroy() method. In addition to calling the destroy() method of the parent class, it also calls the stop() method of the timer wheel (retryTimer field) to release resources related to the timer wheel.

Summary #

In this lesson, the core implementation of FailbackRegistry, an implementation class of AbstractRegistry, was introduced. It is based on AbstractRegistry and provides a retry mechanism. The specific approach is to add retry timer tasks when the core methods, such as register()/unregister() and subscribe()/unsubscribe(), fail. It also adds corresponding logic to clean up timer tasks when necessary.