05 HTTP Invocation Did You Consider Timeout Retry and Concurrency

05 HTTP Invocation Did You Consider Timeout Retry and Concurrency #

Today, let’s discuss the issues of timeouts, retries, and concurrency that need to be considered when making HTTP calls.

Unlike executing local methods, making an HTTP call essentially involves a network request through the HTTP protocol. Network requests inevitably have the possibility of timeouts, so we must take these three points into consideration:

First, whether the default timeout set by the framework is reasonable;

Second, considering the instability of the network, retrying the request after a timeout is a good option, but we need to consider whether the idempotent design of the server interface allows us to retry;

Finally, we need to consider whether the framework, like a browser, limits the number of concurrent connections, so that the concurrency of HTTP calls does not become a bottleneck when the service has high concurrency.

Spring Cloud is a representative framework for Java microservices architecture. If you use Spring Cloud for microservices development, you will use Feign for declarative service calls. If you don’t use Spring Cloud and directly use Spring Boot for microservices development, you may directly use Apache HttpClient, the most commonly used HTTP client in Java, for service calls.

Next, let’s take a look at the pitfalls that may be encountered when using Feign and Apache HttpClient for making HTTP interface calls in terms of timeouts, retries, and concurrency.

The Knowledge of Configuring Connection Timeout and Read Timeout Parameters #

For HTTP calls, even though the application layer uses the HTTP protocol, the underlying network layer is always the TCP/IP protocol. TCP/IP is a connection-oriented protocol that requires establishing a connection before transmitting data. Almost all network frameworks provide two timeout parameters:

  1. Connection timeout parameter ConnectTimeout: allows users to configure the maximum waiting time during the connection establishment phase.

  2. Read timeout parameter ReadTimeout: controls the maximum waiting time for reading data from the socket.

These two parameters may seem like low-level configuration parameters for the network layer, not drawing much attention from developers. However, understanding and configuring these two parameters correctly are crucial for business applications. After all, timeouts are not one-sided issues: both the client and server need to have a consistent estimation of timeouts and work together to balance throughput and error rates.

There are two common misconceptions regarding connection timeouts:

  1. Setting an excessively long connection timeout, such as 60 seconds. Generally, establishing a TCP three-way handshake connection takes a very short time, usually ranging from milliseconds to at most seconds. It is unlikely to take more than tens of seconds. If the connection cannot be established for a long time, it is likely due to network or firewall configuration issues. In this case, if the connection cannot be established within a few seconds, it is unlikely to be established forever. Therefore, setting an excessively long connection timeout is not meaningful. It is better to configure it shorter (such as 1-5 seconds). If it is for calling services within an internal network, this parameter can be set even shorter to enable quick failure when the downstream service is offline and cannot be connected.

  2. Troubleshooting connection timeout issues without clarifying which side the connection is. Normally, our services have multiple nodes. If other clients connect to the server using client-side load balancing techniques, the client and server will establish a connection directly. In this case, a connection timeout is likely a server problem. However, if the server uses a reverse proxy like Nginx for load balancing, the client actually connects to Nginx, not the server. In this case, a connection timeout should be investigated on the Nginx side.

There are more misconceptions about read timeouts. I summarize them into the following three:

  1. Misconception: Assuming that when a read timeout occurs, the server’s execution will be interrupted.

Let’s do a simple test. Define a client interface that internally calls the server interface using HttpClient. The client has a read timeout of 2 seconds, and the server interface takes 5 seconds to execute.

@RestController
@RequestMapping("clientreadtimeout")
@Slf4j
public class ClientReadTimeoutController {

    private String getResponse(String url, int connectTimeout, int readTimeout) throws IOException {
        return Request.Get("http://localhost:45678/clientreadtimeout" + url)
                .connectTimeout(connectTimeout)
                .socketTimeout(readTimeout)
                .execute()
                .returnContent()
                .asString();
    }


    @GetMapping("client")
    public String client() throws IOException {
        log.info("client1 called");
        // Server has a 5s timeout, and the client has a read timeout of 2 seconds
        return getResponse("/server?timeout=5000", 1000, 2000);
    }

    @GetMapping("server")
    public void server(@RequestParam("timeout") int timeout) throws InterruptedException {
        log.info("server called");
        TimeUnit.MILLISECONDS.sleep(timeout);
        log.info("Done");
    }

}

After calling the client interface, we can see from the logs that the client encounters a SocketTimeoutException after 2 seconds, indicating a read timeout, while the server continues to execute and completes after 3 seconds.

[11:35:11.943] [http-nio-45678-exec-1] [INFO ] [.t.c.c.d.ClientReadTimeoutController:29  ] - client1 called
[11:35:12.032] [http-nio-45678-exec-2] [INFO ] [.t.c.c.d.ClientReadTimeoutController:36  ] - server called
[11:35:14.042] [http-nio-45678-exec-1] [ERROR] [.a.c.c.C.[.[.[/].[dispatcherServlet]:175 ] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception
java.net.SocketTimeoutException: Read timed out
  at java.net.SocketInputStream.socketRead0(Native Method)
  ...

[11:35:17.036] [http-nio-45678-exec-2] [INFO ] [.t.c.c.d.ClientReadTimeoutController:38  ] - Done

We know that web servers like Tomcat submit server requests to a thread pool for processing. As long as the server receives the request, network timeouts and disconnections will not affect the server’s execution. Therefore, encountering a read timeout cannot assume anything about the server’s processing. It is necessary to consider how to proceed based on the business status.

  1. Misconception: Assuming that read timeout is only a concept at the socket network level, representing the longest time for data transmission, therefore setting it to a very short duration, such as 100 milliseconds.

In fact, when a read timeout occurs, the network layer cannot distinguish whether the server has not returned the data to the client or the data is delayed or lost in the network.

However, because TCP transmits data after establishing a connection, for service calls with not particularly poor network conditions, a connection timeout is usually regarded as a network problem or the service being offline, while a read timeout is usually regarded as a service processing timeout. Specifically, a read timeout refers to the timeout period waiting for the socket to return data after writing data to it. This timeout period mainly consists of the time spent on server-side business logic processing.

  1. Misconception: Assuming that the longer the timeout, the higher the success rate of the API call, and setting the read timeout parameter too long.

Performing an HTTP request usually requires obtaining results and belongs to synchronous calls. If the timeout is set to a long duration, both the client thread (usually a Tomcat thread) and the program are waiting while waiting for the server to return data. When there are many timeouts in downstream services, the program may also be burdened with creating a large number of threads and eventually crash.

For scheduled or asynchronous tasks, it is not a problem to set a longer read timeout. However, for user response requests or synchronous API calls in microservices, which have a high concurrency, a relatively short read timeout should be set to prevent being slowed down by downstream services. Generally, a read timeout longer than 30 seconds is not recommended.

You may argue that if the read timeout is set to 2 seconds and the server interface takes 3 seconds, then it is impossible to obtain the execution result. Indeed, this is the case. Therefore, the read timeout must be set according to the actual situation. Setting it too long may be affected by the oscillation of downstream services, while setting it too short may affect the success rate. Sometimes, we even need to set different client read timeouts for different server interfaces based on the SLA (Service Level Agreement) of the downstream services.

How to configure timeouts with Feign and Ribbon? #

Earlier, I emphasized the importance of configuring connection and read timeouts according to your needs. Have you ever tried configuring timeout parameters for Spring Cloud’s Feign and got confused by various online resources?

In my opinion, the complexity of configuring timeout parameters for Feign lies in the fact that Feign itself has two timeout parameters, and the load balancing component Ribbon also has related configurations. So, what is the priority of these configurations and what are the pitfalls? Let’s do some experiments.

To test the server timeout, let’s assume there is a server endpoint that does nothing but sleep for 10 minutes:

@PostMapping("/server")
public void server() throws InterruptedException {
    TimeUnit.MINUTES.sleep(10);
}

Firstly, define a Feign client to call this endpoint:

@FeignClient(name = "clientsdk")
public interface Client {
    @PostMapping("/feignandribbon/server")
    void server();
}

Then, make the API call using the Feign client:

@GetMapping("client")
public void timeout() {
    long begin = System.currentTimeMillis();
    try {
        client.server();
    } catch (Exception ex) {
        log.warn("Execution time: {}ms, Error: {}", System.currentTimeMillis() - begin, ex.getMessage());
    }
}

With the server address specified in the configuration file:

clientsdk.ribbon.listOfServers=localhost:45678

You get the following output:

[15:40:16.094] [http-nio-45678-exec-3] [WARN ] [o.g.t.c.h.f.FeignAndRibbonController    :26  ] - Execution time: 1007ms, Error: Read timed out executing POST http://clientsdk/feignandribbon/server

From this output, we can conclude that the default read timeout for Feign is 1 second, which is quite short and can be considered a pitfall.

Let’s analyze the source code. When we open the RibbonClientConfiguration class, we can see that DefaultClientConfigImpl is created, and ReadTimeout and ConnectTimeout are set to 1 second:

public static final int DEFAULT_CONNECT_TIMEOUT = 1000;
public static final int DEFAULT_READ_TIMEOUT = 1000;

@Bean
@ConditionalOnMissingBean
public IClientConfig ribbonClientConfig() {
    DefaultClientConfigImpl config = new DefaultClientConfigImpl();
    config.loadProperties(this.name);
    config.set(CommonClientConfigKey.ConnectTimeout, DEFAULT_CONNECT_TIMEOUT);
    config.set(CommonClientConfigKey.ReadTimeout, DEFAULT_READ_TIMEOUT);
    config.set(CommonClientConfigKey.GZipPayload, DEFAULT_GZIP_PAYLOAD);
    return config;
}

If you want to modify the default global timeout for the Feign client, you can set the feign.client.config.default.readTimeout and feign.client.config.default.connectTimeout parameters:

feign.client.config.default.readTimeout=3000
feign.client.config.default.connectTimeout=3000

After modifying the configuration and retrying, you get the following log:

[15:43:39.955] [http-nio-45678-exec-3] [WARN ] [o.g.t.c.h.f.FeignAndRibbonController    :26  ] - Execution time: 3006ms, Error: Read timed out executing POST http://clientsdk/feignandribbon/server

As you can see, the 3-second read timeout takes effect. Note: here’s a big pitfall - if you only intend to modify the read timeout, you might try configuring only this line:

feign.client.config.default.readTimeout=3000

But if you test it, you will find that this configuration does not take effect!

The second conclusion, which is also a pitfall, is that in order to configure Feign’s read timeout, you must also configure the connect timeout for it to take effect.

By examining the FeignClientFactoryBean class, we can see that Request.Options will only be overridden if both ConnectTimeout and ReadTimeout are set:

if (config.getConnectTimeout() != null && config.getReadTimeout() != null) {
   builder.options(new Request.Options(config.getConnectTimeout(), config.getReadTimeout()));
}

Furthermore, if you want to set timeouts specific to individual Feign clients, you can replace default with the client’s name:

feign.client.config.default.readTimeout=3000
feign.client.config.default.connectTimeout=3000
feign.client.config.clientsdk.readTimeout=2000
feign.client.config.clientsdk.connectTimeout=2000

The third conclusion is that individual timeouts can override global timeouts, which is expected and not considered a pitfall:

[15:45:51.708] [http-nio-45678-exec-3] [WARN ] [o.g.t.c.h.f.FeignAndRibbonController    :26  ] - Execution time: 2006ms, Error: Read timed out executing POST http://clientsdk/feignandribbon/server

The fourth conclusion is that besides configuring Feign, you can also configure the timeout parameters of the Ribbon component to modify the timeouts. The pitfall here is that the parameter names must start with uppercase letters, unlike Feign’s configuration.

ribbon.ReadTimeout=4000
ribbon.ConnectTimeout=4000

You can confirm that the parameters take effect through the log:

[15:55:18.019] [http-nio-45678-exec-3] [WARN ] [o.g.t.c.h.f.FeignAndRibbonController    :26  ] - Execution time: 4003ms, Error: Read timed out executing POST http://clientsdk/feignandribbon/server

Finally, let’s see which parameters take effect when configuring both Feign and Ribbon. Consider the following parameter configuration:

clientsdk.ribbon.listOfServers=localhost:45678
feign.client.config.default.readTimeout=3000
feign.client.config.default.connectTimeout=3000
ribbon.ReadTimeout=4000
ribbon.ConnectTimeout=4000

The log output confirms that the timeout specified by Feign takes effect:

[16:01:19.972] [http-nio-45678-exec-3] [WARN ] [o.g.t.c.h.f.FeignAndRibbonController    :26  ] - Execution time: 3006ms, Error: Read timed out executing POST http://clientsdk/feignandribbon/server

The fifth conclusion is that when configuring both Feign and Ribbon timeouts, Feign takes precedence. This goes against intuition because Ribbon is a lower-level component and you might expect its configuration to take effect, but that’s not the case.

In the LoadBalancerFeignClient source code, we can see that if Request.Options is not the default value, a FeignOptionsClientConfig is created to replace Ribbon’s DefaultClientConfigImpl, leading to Ribbon’s configuration getting overridden by Feign:

IClientConfig getClientConfig(Request.Options options, String clientName) {
   IClientConfig requestConfig;

   if (options == DEFAULT_OPTIONS) {
      requestConfig = this.clientFactory.getClientConfig(clientName);
   } else {
      requestConfig = new FeignOptionsClientConfig(options);
   }
   return requestConfig;
}

However, if you configure it this way, the timeout specified by Ribbon will take effect (4 seconds). This may make you think that Ribbon overrides Feign, but it is actually due to the second pitfall - configuring only Feign’s read timeout will not take effect:

clientsdk.ribbon.listOfServers=localhost:45678
feign.client.config.default.readTimeout=3000
feign.client.config.clientsdk.readTimeout=2000
ribbon.ReadTimeout=4000

Are you aware that Ribbon automatically retries requests? #

Some HTTP clients often have built-in retry strategies, which is good in theory. After all, packet loss due to network issues is frequently encountered but of short duration. Often, retrying the request once more will result in success. However, we must be cautious if this kind of behavior aligns with our expectations.

I had encountered an issue where SMS messages were sent repeatedly. The caller of the SMS service repeatedly confirmed that there was no retry logic in the code. So where did the problem lie? Let’s recreate this scenario.

First, define an endpoint for sending SMS messages via a GET request. This endpoint contains no logic and simulates latency by sleeping for 2 seconds:

@RestController
@RequestMapping("ribbonretryissueserver")
@Slf4j
public class RibbonRetryIssueServerController {

    @GetMapping("sms")
    public void sendSmsWrong(@RequestParam("mobile") String mobile, @RequestParam("message") String message, HttpServletRequest request) throws InterruptedException {

        // Output the request parameters and sleep for 2 seconds
        log.info("{} is called, {}=>{}", request.getRequestURL().toString(), mobile, message);
        TimeUnit.SECONDS.sleep(2);
    }
}

Configure a Feign client for calling the server:

@FeignClient(name = "SmsClient")
public interface SmsClient {

    @GetMapping("/ribbonretryissueserver/sms")
    void sendSmsWrong(@RequestParam("mobile") String mobile, @RequestParam("message") String message);
}

Feign internally uses the Ribbon component for client-side load balancing. We can set the server list for Ribbon in the configuration file to include two nodes:

SmsClient.ribbon.listOfServers=localhost:45679,localhost:45678

Create a client interface that calls the server using Feign:

@RestController
@RequestMapping("ribbonretryissueclient")
@Slf4j
public class RibbonRetryIssueClientController {

    @Autowired
    private SmsClient smsClient;

    @GetMapping("wrong")
    public String wrong() {
        log.info("client is called");
        try {
            // Call the sendSmsWrong endpoint using Feign
            smsClient.sendSmsWrong("13600000000", UUID.randomUUID().toString());
        } catch (Exception ex) {
            // Catch any network errors that may occur
            log.error("send sms failed : {}", ex.getMessage());
        }
        return "done";
    }
}

Start the server on ports 45678 and 45679 respectively. Then, access the client endpoint on port 45678 to test the scenario. Since the client and server controllers are within the same application, port 45678 plays the role of both the client and server.

In the logs of port 45678, at 29 seconds, the client receives the request and starts calling the server’s sendSmsWrong endpoint. At the same time, the server receives the request. 2 seconds later (compared to the first and third logs), the client outputs a read timeout error:

[12:49:29.020] [http-nio-45678-exec-4] [INFO ] [c.d.RibbonRetryIssueClientController:23  ] - client is called
[12:49:29.026] [http-nio-45678-exec-5] [INFO ] [c.d.RibbonRetryIssueServerController:16  ] - http://localhost:45678/ribbonretryissueserver/sms is called, 13600000000=>a2aa1b32-a044-40e9-8950-7f0189582418
[12:49:31.029] [http-nio-45678-exec-4] [ERROR] [c.d.RibbonRetryIssueClientController:27  ] - send sms failed : Read timed out executing GET http://SmsClient/ribbonretryissueserver/sms?mobile=13600000000&message=a2aa1b32-a044-40e9-8950-7f0189582418

In the logs of the other server on port 45679, there is one request seen at 30 seconds, 1 second after the client endpoint is called:

[12:49:30.029] [http-nio-45679-exec-2] [INFO ] [c.d.RibbonRetryIssueServerController:16  ] - http://localhost:45679/ribbonretryissueserver/sms is called, 13600000000=>a2aa1b32-a044-40e9-8950-7f0189582418

The client endpoint is only logged once, while the server’s endpoint is logged twice. Although the default read timeout for Feign is 1 second, the client outputs a timeout error 2 seconds later. This clearly indicates that the client has taken the initiative to retry, resulting in duplicate SMS messages being sent.

By examining the Ribbon source code, we can see that the MaxAutoRetriesNextServer parameter defaults to 1. In other words, when a GET request encounters an issue on a particular server node (such as a read timeout), Ribbon automatically retries once:

// DefaultClientConfigImpl
public static final int DEFAULT_MAX_AUTO_RETRIES_NEXT_SERVER = 1;
public static final int DEFAULT_MAX_AUTO_RETRIES = 0;

// RibbonLoadBalancedRetryPolicy
public boolean canRetry(LoadBalancedRetryContext context) {

   HttpMethod method = context.getRequest().getMethod();
   return HttpMethod.GET == method || lbContext.isOkToRetryOnAllOperations();
}

@Override
public boolean canRetrySameServer(LoadBalancedRetryContext context) {

   return sameServerCount < lbContext.getRetryHandler().getMaxRetriesOnSameServer()
         && canRetry(context);

}

@Override
public boolean canRetryNextServer(LoadBalancedRetryContext context) {

   // this will be called after a failure occurs and we increment the counter
   // so we check that the count is less than or equals to too make sure
   // we try the next server the right number of times
   return nextServerCount <= lbContext.getRetryHandler().getMaxRetriesOnNextServer()
         && canRetry(context);
}

There are two solutions:

Firstly, change the send SMS endpoint from a GET request to a POST. In fact, there is an API design issue here. An API endpoint with a state should not be defined as a GET request. According to the HTTP protocol specification, GET requests are meant for data retrieval, while POST is used for data submission to the server for modification or addition. The choice between GET and POST should be based on the behavior of the API, not the size of the parameters. A common misconception is that GET requests have parameters included in the URL query string, which is subject to browser length constraints. Therefore, some developers choose to use JSON with a POST request for large parameters and use GET for smaller parameters.

Secondly, configure the MaxAutoRetriesNextServer parameter to 0 to disable automatic retrying on the next server node when a service call fails. Simply add the following line to the configuration file:

ribbon.MaxAutoRetriesNextServer=0

After understanding all of this, do you think the problem lies with the user service or the SMS service?

In my opinion, both sides are at fault. As mentioned earlier, the GET request should be stateless or idempotent, and the SMS endpoint could be designed to support idempotent calls. If the developers of the user service were familiar with the retry mechanism of Ribbon, they may have been able to solve the problem more efficiently.

Concurrency Limits the Crawling Ability of Spiders #

In addition to the pitfalls of timeouts and retries, there is another common problem when making HTTP requests, which is that the limitation of concurrency hinders the processing capability of the program.

I have encountered a web crawling project before, and the overall efficiency of data crawling was very low. Increasing the number of threads in the thread pool did not help. The only solution was to use more machines to create a distributed web crawler. Now, let’s simulate this scenario to see where the problem lies.

Assuming that the server to be crawled is a simple implementation that sleeps for 1 second and returns the number 1:

@GetMapping("server")
public int server() throws InterruptedException {

    TimeUnit.SECONDS.sleep(1);
    return 1;
}

The spider needs to call this interface multiple times to fetch data. To ensure that the thread pool is not the bottleneck for concurrency, we use a newCachedThreadPool with no upper limit on the number of threads as the thread pool for crawling tasks (unless you are very clear about your needs, generally avoid using thread pools with no thread count limit), and then use HttpClient to implement HTTP requests. The request tasks are submitted to the thread pool in a loop for processing, and finally we wait for all tasks to complete and output the execution time:

private int sendRequest(int count, Supplier<CloseableHttpClient> client) throws InterruptedException {

    // Counter for the number of requests sent
    AtomicInteger atomicInteger = new AtomicInteger();

    // Use HttpClient to submit tasks that query data from the server interface to the thread pool for parallel processing
    ExecutorService threadPool = Executors.newCachedThreadPool();

    long begin = System.currentTimeMillis();
    IntStream.rangeClosed(1, count).forEach(i -> {
        threadPool.execute(() -> {
            try (CloseableHttpResponse response = client.get().execute(new HttpGet("http://127.0.0.1:45678/routelimit/server"))) {
                atomicInteger.addAndGet(Integer.parseInt(EntityUtils.toString(response.getEntity())));
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        });
    });

    // Wait until count tasks are completed
    threadPool.shutdown();
    threadPool.awaitTermination(1, TimeUnit.HOURS);
    log.info("Sent {} requests, took {} ms", atomicInteger.get(), System.currentTimeMillis() - begin);
    return atomicInteger.get();
}

First, using CloseableHttpClient constructed with the default PoolingHttpClientConnectionManager implementation, let’s test the time it takes to fetch data 10 times:

static CloseableHttpClient httpClient1;

static {
    httpClient1 = HttpClients.custom().setConnectionManager(new PoolingHttpClientConnectionManager()).build();

}

@GetMapping("wrong")
public int wrong(@RequestParam(value = "count", defaultValue = "10") int count) throws InterruptedException {
    return sendRequest(count, () -> httpClient1);
}

Although each request takes 1 second to complete, our thread pool can expand and use any number of threads. In theory, the time it takes for 10 concurrent requests to be processed is basically the same as the time it takes for 1 request to be processed, which is 1 second. However, the log shows that the actual time taken is 5 seconds:

[12:48:48.122] [http-nio-45678-exec-1] [INFO ] [o.g.t.c.h.r.RouteLimitController        :54  ] - Sent 10 requests, took 5265 ms

Looking at the source code of the PoolingHttpClientConnectionManager, we can see two important parameters:

defaultMaxPerRoute=2, which means the maximum concurrent number of requests to the same host/domain is 2. Our spider needs 10 concurrent requests, so obviously the default value restricts the efficiency of the spider.

maxTotal=20, which means the overall maximum concurrency for all hosts is 20, this is the overall concurrency of HttpClient. Currently, we have 10 requests and a maximum concurrency of 10, which does not become the bottleneck. For example, if you use the same HttpClient to access 10 domains and set defaultMaxPerRoute to 10 to ensure that each domain can reach 10 concurrent requests, you need to set maxTotal to 100.

public PoolingHttpClientConnectionManager(

    final HttpClientConnectionOperator httpClientConnectionOperator,
    final HttpConnectionFactory<HttpRoute, ManagedHttpClientConnection> connFactory,
    final long timeToLive, final TimeUnit timeUnit) {

    ...    
    this.pool = new CPool(new InternalConnectionFactory(
            this.configData, connFactory), 2, 20, timeToLive, timeUnit);
   ...
} 

public CPool(

        final ConnFactory<HttpRoute, ManagedHttpClientConnection> connFactory,
        final int defaultMaxPerRoute, final int maxTotal,
        final long timeToLive, final TimeUnit timeUnit) {

    ...

}}

HttpClient is a very commonly used HTTP client in Java, and this problem often occurs. You might wonder why the default values are so small.

In fact, this cannot be entirely attributed to HttpClient. Many early browsers also limited the number of concurrent requests to the same domain to two. The limitation on concurrent connections to the same domain is actually required by the HTTP 1.1 protocol. Here is a passage:

“Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.”

The HTTP 1.1 protocol was established 20 years ago, and now the capabilities of HTTP servers are much stronger, so some newer browsers have not fully adhered to the limit of 2 concurrent connections, and have increased the concurrency to 8 or even larger. If you need to make a large number of concurrent requests through an HTTP client, regardless of the client you are using, you must confirm whether the default concurrency of the client implementation meets your requirements.

Now that we know where the problem lies, let’s try creating a new HttpClient with relaxed limitations, setting maxPerRoute to 50 and maxTotal to 100. Then, let’s modify the wrong method we just created and test it using the new HttpClient:

httpClient2 = HttpClients.custom().setMaxConnPerRoute(10).setMaxConnTotal(20).build();

The output is as follows: the 10 requests are completed in about 1 second. As you can see, because we relaxed the default limitation of 2 concurrent requests per host, the efficiency of the spider has been greatly improved:

[12:58:11.333] [http-nio-45678-exec-3] [INFO ] [o.g.t.c.h.r.RouteLimitController        :54  ] - Sent 10 requests, took 1023 ms

Key Takeaways #

Today, I shared with you the most commonly encountered issues with timeouts, retries, and concurrency when making HTTP calls.

A connection timeout represents the time it takes to establish a TCP connection, while a read timeout represents the time it takes to wait for the remote server to return data, including the time it takes for the remote server to process the request. When dealing with connection timeouts, we need to identify who we are connecting to. When faced with read timeouts, we need to consider both the downstream service’s performance requirements and our own service’s performance requirements, and set an appropriate read timeout. Additionally, when using frameworks like Spring Cloud Feign, it is important to verify that the configuration of connection and read timeout parameters is effective.

Regarding retries, since the HTTP protocol considers GET requests as data retrieval operations and assumes they are stateless, and because network packet loss is a common occurrence, some HTTP clients or proxy servers automatically retry GET/HEAD requests. If your API does not support idempotence, you may need to disable automatic retries. However, a better solution is to follow the recommendations of the HTTP protocol and use appropriate HTTP methods.

Finally, we saw that HTTP clients, including HttpClient and web browsers, limit the maximum concurrency of client calls. If your client generates a high volume of concurrent requests, such as in web crawling or acting as a proxy server, or if your program itself has high concurrency, the default limit can easily become a bottleneck for throughput and should be adjusted promptly.

The code used today is available on GitHub. You can click on this link to view it.

Reflection and Discussion #

In the first section, we emphasized the importance of configuring the parameters for connection timeout and read timeout. Most HTTP clients also provide these two parameters. However, why do we rarely see the concept of “write timeout”?

In addition to Ribbon’s AutoRetriesNextServer retry mechanism, Nginx also has similar retry functionality. Are you familiar with the configuration of Nginx?

Have you encountered any pitfalls regarding HTTP invocations? I am Zhuyi, and I welcome you to leave a comment in the comment section to share your thoughts. Feel free to share this article with your friends or colleagues for further discussion.