24 Business Code Completion Doesn't Mean Production Is Ready

24 Business Code Completion Doesn’t Mean Production Is Ready #

Today, let’s discuss whether completing the business code means being production-ready and ready for deployment.

The term “production-ready” refers to the additional work that needs to be done at the development level when an application is ready to be deployed in a production environment. In my opinion, if the application has only completed the functional code and is immediately deployed, it means that the application is essentially running naked. In this case, encountering problems without effective monitoring makes it impossible to troubleshoot and locate issues. It is also likely that we may encounter problems that we are not even aware of and only find out through user feedback.

So, what work needs to be done to be production-ready? I believe the following three aspects are the most important.

First, provide health check interfaces. The traditional use of the “ping” method to perform liveness detection on an application is not accurate. Sometimes, the key internal or external dependencies of an application may have gone offline, making it unable to function properly, even though its web or management ports can still be pinged. We should provide a dedicated monitoring and detection interface, and try to reach some internal components as much as possible.

Second, expose internal application information. Internal components such as thread pools and memory queues often play important roles within an application. If an application or application framework can expose and monitor these important information externally, it is possible to detect signs of major issues such as OOM before they become more serious problems.

Third, establish application metric monitoring. Metrics refers to the periodic statistical aggregation of important information in a numerical form, and the creation of various trend charts. The monitoring of metrics includes two aspects: first, the monitoring of metrics for important internal components of the application, such as JVM metrics and interface QPS; second, the monitoring of business data for the application, such as e-commerce order volume and online game players.

Today, I will discuss how to quickly implement these three aspects through practical examples.

Preparation: Configuring Spring Boot Actuator #

Spring Boot has an Actuator module that encapsulates production-ready features such as health checks, application information, and metrics. The content in the rest of today’s lesson relies on Actuator, so we need to first introduce and configure Actuator.

We can introduce Actuator by adding the following dependency in the pom.xml file:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

After that, you can directly use Actuator, but there are a few important configurations to note:

If you don’t want the Actuator management port to be the same as the application port, you can use management.server.port to set a separate port.

Actuator comes with many built-in endpoints that provide information out-of-the-box. These endpoints can be exposed through JMX or web. Considering that some information may be sensitive, these endpoints are not fully open by default. You can check the default values on the official website. To facilitate the subsequent demo, we will set all endpoints to be exposed through the web.

By default, the root address for web access to Actuator is /actuator, but you can modify it using the management.endpoints.web.base-path parameter. Let me demonstrate how to change it to /admin:

management.server.port=45679
management.endpoints.web.exposure.include=*
management.endpoints.web.base-path=/admin

Now, you can access http://localhost:45679/admin to view all the functionality URLs provided by Actuator:

img

Most of the endpoints provide read-only information, such as querying Spring beans, ConfigurableEnvironment, scheduled tasks, Spring Boot auto-configuration, Spring MVC mappings, etc. A few endpoints also provide modification functionality, such as graceful shutdown of the application, downloading thread dumps, downloading heap dumps, changing log levels, etc.

You can access this link to view the functionality of all these endpoints and learn more about the information they provide and the operations they perform. In addition, I’d like to share a great Spring Boot management tool called Spring Boot Admin, which packages most of the functionality provided by Actuator endpoints into a web UI.

Health checks require access to key components #

In this section, we mentioned that health checks allow monitoring systems or deployment tools to understand the true health status of an application, which is more reliable than pinging application ports. However, the most crucial aspect to achieving this effect is ensuring that the health check interface can probe the status of key components.

Fortunately, Spring Boot Actuator provides us with pre-implemented health indicators for third-party systems such as databases, InfluxDB, Elasticsearch, Redis, and RabbitMQ.

Through Spring Boot’s auto-configuration, these indicators will be automatically activated. When these components have problems, the HealthIndicator will return a DOWN or OUT_OF_SERVICE status, and the health endpoint’s HTTP response status code will also become 503. We can use this information to configure program health status monitoring and alerts.

For demonstration purposes, we can modify the configuration file and set the management.endpoint.health.show-details parameter to always, allowing all users to directly view the health status of each component (if configured as when-authorized, it can be combined with management.endpoint.health.roles to configure authorized roles):

management.endpoint.health.show-details=always

Accessing the health endpoint will show that the health status of components such as databases, disks, RabbitMQ, and Redis is UP, and the overall application status is also UP:

img

After understanding the basic configuration, let’s consider a scenario where the program depends on a critical third-party service, and we want the application’s health status to be DOWN when this service is inaccessible.

For example, the third-party service has a user interface, and the probability of an exception occurring is 50%:

@Slf4j
@RestController
@RequestMapping("user")
public class UserServiceController {
    @GetMapping
    public User getUser(@RequestParam("userId") long id) {
        // 50% chance to return a correct response, 50% chance to throw an exception
        if (ThreadLocalRandom.current().nextInt() % 2 == 0) {
            return new User(id, "name" + id);
        } else {
            throw new RuntimeException("error");
        }
    }
}

To associate this user interface’s correct response with the overall health status of the program, it is simple. We only need to define a UserServiceHealthIndicator that implements the HealthIndicator interface.

In the health method, we use RestTemplate to access the user interface. If the result is correct, we return Health.up() and add the invocation execution time and result as additional information to the Health object. If an exception occurs during the call to the interface, we return Health.down() and add the exception information as additional information to the Health object:

@Component
@Slf4j
public class UserServiceHealthIndicator implements HealthIndicator {
    @Autowired
    private RestTemplate restTemplate;

    @Override
    public Health health() {
        long begin = System.currentTimeMillis();
        long userId = 1L;
        User user = null;
        
        try {
            // Access remote interface
            user = restTemplate.getForObject("http://localhost:45678/user?userId=" + userId, User.class);
            
            if (user != null && user.getUserId() == userId) {
                // Result is correct, return UP status, and provide additional information such as execution time and user information
                return Health.up()
                        .withDetail("user", user)
                        .withDetail("took", System.currentTimeMillis() - begin)
                        .build();
            } else {
                // Result is incorrect, return DOWN status, and provide additional information such as execution time
                return Health.down().withDetail("took", System.currentTimeMillis() - begin).build();
            }
        } catch (Exception ex) {
            // Exception occurred, log the exception and return DOWN status, providing additional information such as exception information and execution time
            log.warn("health check failed!", ex);
            return Health.down(ex).withDetail("took", System.currentTimeMillis() - begin).build();
        }
    }
}

Now let’s look at an example of aggregating multiple HealthIndicators. We define a CompositeHealthContributor to aggregate multiple HealthContributors and monitor a group of thread pools.

Firstly, in the ThreadPoolProvider class, we define two thread pools: demoThreadPool, which contains one working thread and has a queue length of 10 using ArrayBlockingQueue; and ioThreadPool, which simulates an IO operation thread pool with a core thread number of 10 and a maximum thread number of 50:

public class ThreadPoolProvider {
    // ThreadPool with one working thread and a queue length of 10
    private static ThreadPoolExecutor demoThreadPool = new ThreadPoolExecutor(
            1, 1,
            2, TimeUnit.SECONDS,
            new ArrayBlockingQueue<>(10),
            new ThreadFactoryBuilder().setNameFormat("demo-threadpool-%d").get());
    
    // ThreadPool with a core thread number of 10 and a maximum thread number of 50, with a queue length of 100
    private static ThreadPoolExecutor ioThreadPool = new ThreadPoolExecutor(
            10, 50,
            2, TimeUnit.SECONDS,
            new ArrayBlockingQueue<>(100),
new ThreadFactoryBuilder().setNameFormat("io-threadpool-%d").get());

public static ThreadPoolExecutor getDemoThreadPool() {

    return demoThreadPool;

}

public static ThreadPoolExecutor getIOThreadPool() {

    return ioThreadPool;

}

Next, we define an interface to submit time-consuming tasks to the demoThreadPool thread pool to simulate a situation where the thread pool queue is full:

@GetMapping("slowTask")

public void slowTask() {

    ThreadPoolProvider.getDemoThreadPool().execute(() -> {

        try {

            TimeUnit.HOURS.sleep(1);

        } catch (InterruptedException e) {

        }

    });

}

After doing these preparations, let’s really implement the custom HealthIndicator class for the status of a single thread pool.

We can pass in a ThreadPoolExecutor and determine the health status of this component by determining the remaining capacity of the queue. If there is remaining capacity, return UP; otherwise, return DOWN. We also add two important data of the thread pool queue, which are the current number of elements in the queue and the remaining capacity, as supplementary information to the Health object:

public class ThreadPoolHealthIndicator implements HealthIndicator {

    private ThreadPoolExecutor threadPool;

    public ThreadPoolHealthIndicator(ThreadPoolExecutor threadPool) {

        this.threadPool = threadPool;

    }

    @Override

    public Health health() {

        // Supplementary information

        Map<String, Integer> detail = new HashMap<>();

        // Current number of elements in the queue

        detail.put("queue_size", threadPool.getQueue().size());

        // Remaining capacity of the queue

        detail.put("queue_remaining", threadPool.getQueue().remainingCapacity());

        // If there is remaining capacity, return UP; otherwise, return DOWN

        if (threadPool.getQueue().remainingCapacity() > 0) {

            return Health.up().withDetails(detail).build();

        } else {

            return Health.down().withDetails(detail).build();

        }

    }

}

Next, let’s define a CompositeHealthContributor to aggregate two instances of ThreadPoolHealthIndicator, corresponding to the two thread pools defined in ThreadPoolProvider:

@Component

public class ThreadPoolsHealthContributor implements CompositeHealthContributor {

    // Save all child HealthContributors

    private Map<String, HealthContributor> contributors = new HashMap<>();

    ThreadPoolsHealthContributor() {

        // Corresponding to the two thread pools defined in ThreadPoolProvider

        this.contributors.put("demoThreadPool", new ThreadPoolHealthIndicator(ThreadPoolProvider.getDemoThreadPool()));

        this.contributors.put("ioThreadPool", new ThreadPoolHealthIndicator(ThreadPoolProvider.getIOThreadPool()));

    }

    @Override

    public HealthContributor getContributor(String name) {

        // Find a specific HealthContributor based on the name

        return contributors.get(name);

    }

    @Override

    public Iterator<NamedContributor<HealthContributor>> iterator() {

        // Return an iterator of NamedContributor, where NamedContributor is an instance of Contributor with a name

        return contributors.entrySet().stream()

                .map((entry) -> NamedContributor.of(entry.getKey(), entry.getValue())).iterator();

    }

}

After the program is started, we can see that the health endpoint displays the health status of the thread pools and the external service userService, as well as some specific information:

img

We can see that when the demoThreadPool is DOWN, it leads to the threadPools being DOWN, which further leads to the entire application’s status being DOWN:

img

That’s it! By customizing HealthContributor and CompositeHealthContributor, we can monitor and detect key components such as third-party services and thread pools within the application. It’s quite convenient, isn’t it?

As an additional note, starting from Spring Boot 2.3.0, health checks have been enhanced, and the Liveness and Readiness endpoints have been refined, making it easier to integrate Spring Boot applications with Kubernetes.

Exposing the Internal State of Important Components #

In addition to using the state of the thread pool as an indicator of the overall health of the application, we can also expose the status data of important components within the application through the InfoContributor feature of Actuator. Here, I will demonstrate how to view the status data using the info HTTP endpoint and JMX MBean with an example.

Let’s take a specific case where we implement a ThreadPoolInfoContributor to display information about the thread pool.

@Component
public class ThreadPoolInfoContributor implements InfoContributor {

    private static Map<String, Object> threadPoolInfo(ThreadPoolExecutor threadPool) {
        Map<String, Object> info = new HashMap<>();
        info.put("poolSize", threadPool.getPoolSize());
        info.put("corePoolSize", threadPool.getCorePoolSize());
        info.put("largestPoolSize", threadPool.getLargestPoolSize());
        info.put("maximumPoolSize", threadPool.getMaximumPoolSize());
        info.put("completedTaskCount", threadPool.getCompletedTaskCount());
        return info;
    }

    @Override
    public void contribute(Info.Builder builder) {
        builder.withDetail("demoThreadPool", threadPoolInfo(ThreadPoolProvider.getDemoThreadPool()));
        builder.withDetail("ioThreadPool", threadPoolInfo(ThreadPoolProvider.getIOThreadPool()));
    }
}

By accessing the /admin/info endpoint, you can see this data:

img

Additionally, if JMX is enabled:

spring.jmx.enabled=true

You can use the jconsole tool and locate the Info MBean in org.springframework.boot.Endpoint. By executing the info operation, you can see the information about the two thread pools that we just customized:

img

Furthermore, I would like to add one more thing. In addition to using jconsole to view and operate MBeans, you can use Jolokia to transform JMX into the HTTP protocol. To do this, you need to add the following dependency:

<dependency>
    <groupId>org.jolokia</groupId>
    <artifactId>jolokia-core</artifactId>
</dependency>

With Jolokia, you can execute the info operation of the org.springframework.boot:type=Endpoint,name=Info MBean:

img

Metrics is the “golden key” for quickly identifying problems #

Metrics refers to a set of quantitative values ​​that are associated with time and measure the ability of a certain dimension. By collecting metrics and displaying them as line charts, pie charts, and other graphs, we can quickly locate and analyze problems.

Let’s take an actual case to see how we can quickly identify problems through charts.

There is an order and delivery process for takeout orders, as shown in the diagram below. The OrderController performs the ordering operation. Before the ordering operation, it checks the parameters. If the parameters are correct, it calls another service to check the status of the merchant. If the merchant is in business, the order continues. After the order is successful, a message is sent to RabbitMQ for asynchronous delivery process. Then another DeliverOrderHandler listens to this message and performs the delivery operation.

img

For a business process involving synchronous calls and asynchronous calls, how can we quickly determine which link has a problem if the user reports a failed order?

At this time, the metrics system can be useful. We can establish some metrics to monitor the two important operations of order placement and delivery.

For the order placement operation, we can establish 4 metrics:

  • Total Order Quantity, which monitors the current total number of orders in the system;
  • Order Request, which increments by 1 for each order request received before processing;
  • Successful Orders, which increments by 1 for each successful order completion;
  • Failed Orders, which increments by 1 for each order operation that encounters an exception, and attaches the exception reason to the metric.

Similar to the order delivery operation, we can also establish 4 metrics. We can use the Micrometer framework to collect metrics. Micrometer is also the metric framework selected by Spring Boot Actuator. It abstracts various metrics, and the commonly used ones include:

  • Gauge (red), which reflects the current value of the metric. For example, the total order quantity metric in this example, and the online number of players in a game or the current number of JVM threads can all be considered gauges;
  • Counter (green), which increases by 1 for each method call and can be accumulated. For example, the order request metric in this example. For example, if we call a method 10 times in 5 seconds, Micrometer will send the metric to the backend storage system once every 5 seconds, and the value will be 10;
  • Timer (blue), similar to counter, but in addition to recording the count, it also records the time, such as the successful order and failed order metrics in this example.

All metrics can be accompanied by tags as supplemental data. For example, when an operation fails, a reason tag will be attached to the metric.

In addition to abstracting metrics, Micrometer also abstracts storage. You can think of Micrometer as a framework similar to SLF4J, but the latter abstracts logs, while Micrometer abstracts metrics. Micrometer introduces various registries, which can seamlessly connect to various monitoring systems or time series databases.

In this case, we introduced the micrometer-registry-influx dependency to introduce the core dependency of Micrometer and bind it to InfluxDB (InfluxDB is a time series database specializing in storing metric data) through Micrometer to store the metric data in InfluxDB:

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-influx</artifactId>
</dependency>

Then, modify the configuration file to enable the switch that exports metrics to InfluxDB, configure the address of InfluxDB, and set the frequency of aggregating metrics client-side and sending them to InfluxDB every second:

management.metrics.export.influx.enabled=true
management.metrics.export.influx.uri=http://localhost:8086
management.metrics.export.influx.step=1S

Next, we add relevant code to record metrics in the business logic.

Below is the implementation of the OrderController. The code contains detailed comments, which I will not explain one by one. You need to pay attention to how to use the Micrometer framework to implement the four metrics: total order quantity, order request, successful orders, and failed orders, corresponding to lines 17, 25, 43, and 47 of the code:

@Slf4j
@RestController
@RequestMapping("order")
public class OrderController {

    // Total number of order creations
    private AtomicLong createOrderCounter = new AtomicLong();

    @Autowired
    private RabbitTemplate rabbitTemplate;

    @Autowired
    private RestTemplate restTemplate;

    @PostConstruct
    public void init() {
        // Register createOrder.totalSuccess metric. To initialize a gauge metric, you only need to associate it with an AtomicLong reference like this.
        Metrics.gauge("createOrder.totalSuccess", createOrderCounter);
    }

    // Order interface, with user ID and merchant ID as input parameters
    @GetMapping("createOrder")
    public void createOrder(@RequestParam("userId") long userId, @RequestParam("merchantId") long merchantId) {
        // Record createOrder.received metric, which is a counter metric indicating the receipt of an order request.
        Metrics.counter("createOrder.received").increment();
        Instant begin = Instant.now();
        try {
            TimeUnit.MILLISECONDS.sleep(200);

            // Simulate a situation where the user is invalid. An ID less than 10 is considered an invalid user.
            if (userId < 10) {
                throw new RuntimeException("invalid user");
            }
            // Query merchant service
            ...
        } 

Hope this helps!

    Boolean merchantStatus = restTemplate.getForObject("http://localhost:45678/order/getMerchantStatus?merchantId=" + merchantId, Boolean.class);

    if (merchantStatus == null || !merchantStatus)
        throw new RuntimeException("closed merchant");

    Order order = new Order();

    order.setId(createOrderCounter.incrementAndGet()); // The gauge indicator can get automatic updates

    order.setUserId(userId);

    order.setMerchantId(merchantId);

    // Send MQ message

    rabbitTemplate.convertAndSend(Consts.EXCHANGE, Consts.ROUTING_KEY, order);

    // Record the createOrder.success indicator once, which is a timer indicator that represents a successful order and provides the duration

    Metrics.timer("createOrder.success").record(Duration.between(begin, Instant.now()));

} catch (Exception ex) {

    log.error("createOrder userId {} failed", userId, ex);

    // Record the createOrder.failed indicator once, which is a timer indicator that represents a failed order and provides the duration, and record the failure reason as a tag

    Metrics.timer("createOrder.failed", "reason", ex.getMessage()).record(Duration.between(begin, Instant.now()));

}



// Merchant query interface

@GetMapping("getMerchantStatus")

public boolean getMerchantStatus(@RequestParam("merchantId") long merchantId) throws InterruptedException {

    // Only merchant ID 2 is open

    TimeUnit.MILLISECONDS.sleep(200);

    return merchantId == 2;

}

When the user ID is less than 10, we simulate the case where the user data is invalid, and when the merchant ID is not 2, we simulate the case where the merchant is not open.

Next is the implementation of the delivery service in the DeliverOrderHandler.

Among them, the deliverOrder method listens for MQ messages emitted by OrderController to simulate delivery. As shown in the following code, lines 17, 25, 32, and 36 implement the recording of four delivery-related indicators:

// Delivery service message handler

@RestController

@Slf4j

@RequestMapping("deliver")

public class DeliverOrderHandler {

    // Delivery service running status

    private volatile boolean deliverStatus = true;

    private AtomicLong deliverCounter = new AtomicLong();

    // Change the delivery status through an external interface to simulate delivery service stoppage

    @PostMapping("status")

    public void status(@RequestParam("status") boolean status) {

        deliverStatus = status;

    }

    @PostConstruct

    public void init() {

        // Also register a gauge indicator deliverOrder.totalSuccess, representing the total number of deliveries, only need to be registered once

        Metrics.gauge("deliverOrder.totalSuccess", deliverCounter);

    }

    // Listen for MQ messages

    @RabbitListener(queues = Consts.QUEUE_NAME)

    public void deliverOrder(Order order) {

        Instant begin = Instant.now();

        // Increment deliverOrder.received, indicating that an order message has been received, of counter type

        Metrics.counter("deliverOrder.received").increment();

        try {

            if (!deliverStatus)

                throw new RuntimeException("deliver outofservice");

            TimeUnit.MILLISECONDS.sleep(500);

            deliverCounter.incrementAndGet();

            // Indicate successful delivery, of timer type

            Metrics.timer("deliverOrder.success").record(Duration.between(begin, Instant.now()));
    } catch (Exception ex) {

        log.error("deliver Order {} failed", order, ex);

        // Failed delivery metric deliverOrder.failed, also attaching the failure reason as tags, timer type

        Metrics.timer("deliverOrder.failed", "reason", ex.getMessage()).record(Duration.between(begin, Instant.now()));

    }

}

}

Meanwhile, we simulated a switch for the overall status of the delivery service, and the status interface can be called to modify its status. So far, we have completed the scene preparation and will now start configuring metric monitoring.

First, let’s install Grafana. Then enter Grafana to configure an InfluxDB data source:

img

After configuring the data source, you can add a monitoring panel and then add various monitoring charts in the panel. For example, in a chart of the number of orders placed, we added three metrics: receive orders, successful orders, and failed orders.

img

About the configuration in this image:

Red box, data source configuration, choose the configured data source.

Blue box, FROM configuration, chooses our metric name.

Green box, SELECT configuration, select the metric fields we want to query, and can also apply some aggregate functions. Here, we take the value of the count field and use the sum function for summation.

Purple box, GROUP BY configuration, we configure grouping by 1-minute time granularity and the reason field. This way, the Y-axis of the metric represents the QPM (queries per minute), and each failure case is plotted as a separate curve.

Yellow box, ALIAS BY configuration, sets the alias for each metric and references the reason tag in the alias.

For more detailed instructions on configuring InfluxDB metrics using Grafana, you can refer here. The meanings of FROM, SELECT, and GROUP BY are similar to SQL and should be easy to understand.

Similarly, we configure a complete business monitoring panel containing the 8 metrics we previously implemented:

Configure 2 Gauge charts to present the total number of orders completed and the total number of deliveries completed.

Configure 4 Graph charts to present the number and performance of order placement, as well as the number and performance of deliveries.

Now we move on to the practical part. We will use wrk to test four scenarios and analyze and locate the problems through the curves.

The first scenario is to use a valid user ID and a business merchant ID to run for a period of time:

wrk -t 1 -c 1 -d 3600s http://localhost:45678/order/createOrder\?userId\=20\&merchantId\=2

The operation of the entire system can be seen at a glance from the monitoring panel. It can be seen that the current system runs well, both order placement and delivery operations are successful, with an average processing time of 400ms for order placement and around 500ms for delivery operations, as expected (note that the green and yellow lines in the order placement curve actually overlap, indicating that all orders were successful):

img

The second scenario is to simulate using an invalid user ID for a period of time:

wrk -t 1 -c 1 -d 3600s http://localhost:45678/order/createOrder\?userId\=2\&merchantId\=2

Using an invalid user for order placement will obviously cause all orders to fail. Next, let’s see if we can see this phenomenon from the monitoring charts.

In the green box, we can see that the order placement now has a blue curve indicating “invalid user”, which matches the received order placement curve, indicating that all orders have failed due to an invalid user error, indicating that there is no problem at the source.

In the red box, although all orders have failed, the time for order placement has decreased from 400ms to 200ms, indicating that 200ms was consumed before the order failed (consistent with the code). And because the response time for the order failed operation has halved, the throughput has doubled.

By observing the two delivery monitoring charts, we can see that the delivery curve drops to 0 due to the order placement failure, as the order failed MQ messages are not sent out at all. Also, pay attention to the blue line, it can be seen that the delivery curve drops to 0 after the order placement success curve drops to 0, indicating that the delivery process is asynchronous. Although all orders failed from a certain point in time, there are still some messages in the MQ queue that have not been processed.

img

The third scenario is to try order placement failure due to the merchant being closed:

wrk -t 1 -c 1 -d 3600s http://localhost:45678/order/createOrder\?userId\=20\&merchantId\=1

I circled the parts that changed. You can try to analyze it yourself:

img

The fourth scenario is to stop delivery. We use curl to call the interface to set the delivery stop switch:

curl -X POST 'http://localhost:45678/deliver/status?status=false'

From the monitoring, we can see that since the switch was closed, all delivery messages have failed, the reason being “deliver outofservice”. The performance of the delivery operation has changed from around 500ms to 0ms, indicating that the delivery failure is a local fast failure and not due to service timeouts, etc. Although the delivery failed, the order placement operations have been normal:

img

Finally, I want to mention that in addition to manually adding business monitoring metrics, the Micrometer framework also automatically generates many metrics related to the JVM’s internal data. When entering the InfluxDB command-line client, you can see the following tables (metrics). The first 8 are the business metrics we created ourselves, and the rest are the metrics of the JVM and various component statuses generated by the framework:

  > USE mydb

  Using database mydb

  > SHOW MEASUREMENTS

  name: measurements

  name

  ----

  createOrder_failed

  createOrder_received

  createOrder_success

  createOrder_totalSuccess

  deliverOrder_failed

  deliverOrder_received
    deliverOrder_success

    deliverOrder_totalSuccess
    
    hikaricp_connections
    
    hikaricp_connections_acquire
    
    hikaricp_connections_active
    
    hikaricp_connections_creation
    
    hikaricp_connections_idle
    
    hikaricp_connections_max
    
    hikaricp_connections_min
    
    hikaricp_connections_pending
    
    hikaricp_connections_timeout
    
    hikaricp_connections_usage
    
    http_server_requests
    
    jdbc_connections_max
    
    jdbc_connections_min
    
    jvm_buffer_count
    
    jvm_buffer_memory_used
    
    jvm_buffer_total_capacity
    
    jvm_classes_loaded
    
    jvm_classes_unloaded
    
    jvm_gc_live_data_size
    
    jvm_gc_max_data_size
    
    jvm_gc_memory_allocated
    
    jvm_gc_memory_promoted
    
    jvm_gc_pause
    
    jvm_memory_committed
    
    jvm_memory_max
    
    jvm_memory_used
    
    jvm_threads_daemon
    
    jvm_threads_live
    
    jvm_threads_peak
    
    jvm_threads_states
    
    logback_events
    
    process_cpu_usage
    
    process_files_max
    
    process_files_open
    
    process_start_time
    
    process_uptime
    
    rabbitmq_acknowledged
    
    rabbitmq_acknowledged_published
    
    rabbitmq_channels
    
    rabbitmq_connections
    
    rabbitmq_consumed
    
    rabbitmq_failed_to_publish
    
    rabbitmq_not_acknowledged_published
    
    rabbitmq_published
    
    rabbitmq_rejected
    
    rabbitmq_unrouted_published
    
    spring_rabbitmq_listener
    
    system_cpu_count
    
    system_cpu_usage
    
    system_load_average_1m
    
    tomcat_sessions_active_current
    
    tomcat_sessions_active_max
    
    tomcat_sessions_alive_max
    
    tomcat_sessions_created
    
    tomcat_sessions_expired
    
    tomcat_sessions_rejected

We can select some of these metrics according to our own needs and configure application monitoring panels in Grafana:

img

By using the monitoring charts to locate issues, it is much more convenient than using logs, right?

Key Review #

Today, I introduced to you several key points on how to implement production readiness using Spring Boot Actuator, including health checks, exposing application information, and monitoring metrics.

As the saying goes, “Sharpen your knife before you chop wood.” Health checks can help us achieve load balancing coordination. Application information and various endpoints provided by Actuator can help us view internal application details and even adjust some application parameters. Metrics monitoring, on the other hand, helps us observe the overall application performance and quickly discover and locate issues.

In fact, a complete application monitoring system generally consists of three aspects: logging, metrics, and tracing. I believe you already have a good understanding of logging and metrics. Tracing, which generally does not involve development work, has not been explained. Let me give you a brief introduction.

Tracing, also known as distributed tracing, is represented by open-source systems such as SkyWalking and Pinpoint. Typically, integrating with these systems does not require additional development. By using the javaagent provided by these systems to start a Java program, various components can be dynamically modified by bytecode manipulation to include tracing code (similar to AOP).

The principle of distributed tracing is as follows:

  • When a request enters the first component, a TraceID is generated as a unique identifier for the entire call chain (Trace).
  • For each operation, the time consumption and related information are recorded, forming a Span attached to the call chain. Spans can also be associated in a tree-like structure. When there is a remote call or cross-system call, the TraceID is transmitted (e.g., the TraceID can be passed through the request in an HTTP call or through messages in MQ).
  • These data are aggregated and submitted to the database, and the entire tree-like call chain can be queried through a UI interface.

At the same time, we usually record the TraceID in the logs to facilitate the association between logs and tracing.

I have compared the differences and characteristics of logging, metrics, and tracing with a diagram:

img

In my opinion, a complete monitoring system requires all three components. They can also work together, for example, using metrics to identify performance issues, using tracing to locate the problematic application and operations causing the performance issues, and finally using logs to identify the specific details of the requests.

I have put the code used today on GitHub, and you can click on this link to view it.

Reflection and Discussion #

Spring Boot Actuator provides a large number of built-in endpoints. What is the difference between an endpoint and a custom @RestController? Can you develop a custom endpoint based on the official documentation?

In the introduction of Metrics, we saw that InfluxDB stores some application metrics automatically collected by the Micrometer framework for us. Can you refer to the two JSON files for Grafana configurations in the source code and configure a complete application monitoring dashboard in Grafana with these metrics?

Before the application is put into production, what other production-ready work would you do? I am Zhu Ye. Feel free to leave a comment in the comment section and share your thoughts. You are also welcome to share today’s content with your friends or colleagues for further discussion.