24 Business Code Completion Doesn’t Mean Production Is Ready #
Today, let’s discuss whether completing the business code means being production-ready and ready for deployment.
The term “production-ready” refers to the additional work that needs to be done at the development level when an application is ready to be deployed in a production environment. In my opinion, if the application has only completed the functional code and is immediately deployed, it means that the application is essentially running naked. In this case, encountering problems without effective monitoring makes it impossible to troubleshoot and locate issues. It is also likely that we may encounter problems that we are not even aware of and only find out through user feedback.
So, what work needs to be done to be production-ready? I believe the following three aspects are the most important.
First, provide health check interfaces. The traditional use of the “ping” method to perform liveness detection on an application is not accurate. Sometimes, the key internal or external dependencies of an application may have gone offline, making it unable to function properly, even though its web or management ports can still be pinged. We should provide a dedicated monitoring and detection interface, and try to reach some internal components as much as possible.
Second, expose internal application information. Internal components such as thread pools and memory queues often play important roles within an application. If an application or application framework can expose and monitor these important information externally, it is possible to detect signs of major issues such as OOM before they become more serious problems.
Third, establish application metric monitoring. Metrics refers to the periodic statistical aggregation of important information in a numerical form, and the creation of various trend charts. The monitoring of metrics includes two aspects: first, the monitoring of metrics for important internal components of the application, such as JVM metrics and interface QPS; second, the monitoring of business data for the application, such as e-commerce order volume and online game players.
Today, I will discuss how to quickly implement these three aspects through practical examples.
Preparation: Configuring Spring Boot Actuator #
Spring Boot has an Actuator module that encapsulates production-ready features such as health checks, application information, and metrics. The content in the rest of today’s lesson relies on Actuator, so we need to first introduce and configure Actuator.
We can introduce Actuator by adding the following dependency in the pom.xml
file:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
After that, you can directly use Actuator, but there are a few important configurations to note:
If you don’t want the Actuator management port to be the same as the application port, you can use management.server.port
to set a separate port.
Actuator comes with many built-in endpoints that provide information out-of-the-box. These endpoints can be exposed through JMX or web. Considering that some information may be sensitive, these endpoints are not fully open by default. You can check the default values on the official website. To facilitate the subsequent demo, we will set all endpoints to be exposed through the web.
By default, the root address for web access to Actuator is /actuator
, but you can modify it using the management.endpoints.web.base-path
parameter. Let me demonstrate how to change it to /admin
:
management.server.port=45679
management.endpoints.web.exposure.include=*
management.endpoints.web.base-path=/admin
Now, you can access http://localhost:45679/admin to view all the functionality URLs provided by Actuator:
Most of the endpoints provide read-only information, such as querying Spring beans, ConfigurableEnvironment
, scheduled tasks, Spring Boot auto-configuration, Spring MVC mappings, etc. A few endpoints also provide modification functionality, such as graceful shutdown of the application, downloading thread dumps, downloading heap dumps, changing log levels, etc.
You can access this link to view the functionality of all these endpoints and learn more about the information they provide and the operations they perform. In addition, I’d like to share a great Spring Boot management tool called Spring Boot Admin, which packages most of the functionality provided by Actuator endpoints into a web UI.
Health checks require access to key components #
In this section, we mentioned that health checks allow monitoring systems or deployment tools to understand the true health status of an application, which is more reliable than pinging application ports. However, the most crucial aspect to achieving this effect is ensuring that the health check interface can probe the status of key components.
Fortunately, Spring Boot Actuator provides us with pre-implemented health indicators for third-party systems such as databases, InfluxDB, Elasticsearch, Redis, and RabbitMQ.
Through Spring Boot’s auto-configuration, these indicators will be automatically activated. When these components have problems, the HealthIndicator will return a DOWN or OUT_OF_SERVICE status, and the health endpoint’s HTTP response status code will also become 503. We can use this information to configure program health status monitoring and alerts.
For demonstration purposes, we can modify the configuration file and set the management.endpoint.health.show-details
parameter to always
, allowing all users to directly view the health status of each component (if configured as when-authorized
, it can be combined with management.endpoint.health.roles
to configure authorized roles):
management.endpoint.health.show-details=always
Accessing the health endpoint will show that the health status of components such as databases, disks, RabbitMQ, and Redis is UP, and the overall application status is also UP:
After understanding the basic configuration, let’s consider a scenario where the program depends on a critical third-party service, and we want the application’s health status to be DOWN when this service is inaccessible.
For example, the third-party service has a user
interface, and the probability of an exception occurring is 50%:
@Slf4j
@RestController
@RequestMapping("user")
public class UserServiceController {
@GetMapping
public User getUser(@RequestParam("userId") long id) {
// 50% chance to return a correct response, 50% chance to throw an exception
if (ThreadLocalRandom.current().nextInt() % 2 == 0) {
return new User(id, "name" + id);
} else {
throw new RuntimeException("error");
}
}
}
To associate this user
interface’s correct response with the overall health status of the program, it is simple. We only need to define a UserServiceHealthIndicator
that implements the HealthIndicator
interface.
In the health
method, we use RestTemplate
to access the user
interface. If the result is correct, we return Health.up()
and add the invocation execution time and result as additional information to the Health
object. If an exception occurs during the call to the interface, we return Health.down()
and add the exception information as additional information to the Health
object:
@Component
@Slf4j
public class UserServiceHealthIndicator implements HealthIndicator {
@Autowired
private RestTemplate restTemplate;
@Override
public Health health() {
long begin = System.currentTimeMillis();
long userId = 1L;
User user = null;
try {
// Access remote interface
user = restTemplate.getForObject("http://localhost:45678/user?userId=" + userId, User.class);
if (user != null && user.getUserId() == userId) {
// Result is correct, return UP status, and provide additional information such as execution time and user information
return Health.up()
.withDetail("user", user)
.withDetail("took", System.currentTimeMillis() - begin)
.build();
} else {
// Result is incorrect, return DOWN status, and provide additional information such as execution time
return Health.down().withDetail("took", System.currentTimeMillis() - begin).build();
}
} catch (Exception ex) {
// Exception occurred, log the exception and return DOWN status, providing additional information such as exception information and execution time
log.warn("health check failed!", ex);
return Health.down(ex).withDetail("took", System.currentTimeMillis() - begin).build();
}
}
}
Now let’s look at an example of aggregating multiple HealthIndicator
s. We define a CompositeHealthContributor
to aggregate multiple HealthContributor
s and monitor a group of thread pools.
Firstly, in the ThreadPoolProvider
class, we define two thread pools: demoThreadPool
, which contains one working thread and has a queue length of 10 using ArrayBlockingQueue
; and ioThreadPool
, which simulates an IO operation thread pool with a core thread number of 10 and a maximum thread number of 50:
public class ThreadPoolProvider {
// ThreadPool with one working thread and a queue length of 10
private static ThreadPoolExecutor demoThreadPool = new ThreadPoolExecutor(
1, 1,
2, TimeUnit.SECONDS,
new ArrayBlockingQueue<>(10),
new ThreadFactoryBuilder().setNameFormat("demo-threadpool-%d").get());
// ThreadPool with a core thread number of 10 and a maximum thread number of 50, with a queue length of 100
private static ThreadPoolExecutor ioThreadPool = new ThreadPoolExecutor(
10, 50,
2, TimeUnit.SECONDS,
new ArrayBlockingQueue<>(100),
new ThreadFactoryBuilder().setNameFormat("io-threadpool-%d").get());
public static ThreadPoolExecutor getDemoThreadPool() {
return demoThreadPool;
}
public static ThreadPoolExecutor getIOThreadPool() {
return ioThreadPool;
}
Next, we define an interface to submit time-consuming tasks to the demoThreadPool
thread pool to simulate a situation where the thread pool queue is full:
@GetMapping("slowTask")
public void slowTask() {
ThreadPoolProvider.getDemoThreadPool().execute(() -> {
try {
TimeUnit.HOURS.sleep(1);
} catch (InterruptedException e) {
}
});
}
After doing these preparations, let’s really implement the custom HealthIndicator
class for the status of a single thread pool.
We can pass in a ThreadPoolExecutor
and determine the health status of this component by determining the remaining capacity of the queue. If there is remaining capacity, return UP
; otherwise, return DOWN
. We also add two important data of the thread pool queue, which are the current number of elements in the queue and the remaining capacity, as supplementary information to the Health
object:
public class ThreadPoolHealthIndicator implements HealthIndicator {
private ThreadPoolExecutor threadPool;
public ThreadPoolHealthIndicator(ThreadPoolExecutor threadPool) {
this.threadPool = threadPool;
}
@Override
public Health health() {
// Supplementary information
Map<String, Integer> detail = new HashMap<>();
// Current number of elements in the queue
detail.put("queue_size", threadPool.getQueue().size());
// Remaining capacity of the queue
detail.put("queue_remaining", threadPool.getQueue().remainingCapacity());
// If there is remaining capacity, return UP; otherwise, return DOWN
if (threadPool.getQueue().remainingCapacity() > 0) {
return Health.up().withDetails(detail).build();
} else {
return Health.down().withDetails(detail).build();
}
}
}
Next, let’s define a CompositeHealthContributor
to aggregate two instances of ThreadPoolHealthIndicator
, corresponding to the two thread pools defined in ThreadPoolProvider
:
@Component
public class ThreadPoolsHealthContributor implements CompositeHealthContributor {
// Save all child HealthContributors
private Map<String, HealthContributor> contributors = new HashMap<>();
ThreadPoolsHealthContributor() {
// Corresponding to the two thread pools defined in ThreadPoolProvider
this.contributors.put("demoThreadPool", new ThreadPoolHealthIndicator(ThreadPoolProvider.getDemoThreadPool()));
this.contributors.put("ioThreadPool", new ThreadPoolHealthIndicator(ThreadPoolProvider.getIOThreadPool()));
}
@Override
public HealthContributor getContributor(String name) {
// Find a specific HealthContributor based on the name
return contributors.get(name);
}
@Override
public Iterator<NamedContributor<HealthContributor>> iterator() {
// Return an iterator of NamedContributor, where NamedContributor is an instance of Contributor with a name
return contributors.entrySet().stream()
.map((entry) -> NamedContributor.of(entry.getKey(), entry.getValue())).iterator();
}
}
After the program is started, we can see that the health
endpoint displays the health status of the thread pools and the external service userService
, as well as some specific information:
We can see that when the demoThreadPool
is DOWN
, it leads to the threadPools
being DOWN
, which further leads to the entire application’s status
being DOWN
:
That’s it! By customizing HealthContributor
and CompositeHealthContributor
, we can monitor and detect key components such as third-party services and thread pools within the application. It’s quite convenient, isn’t it?
As an additional note, starting from Spring Boot 2.3.0, health checks have been enhanced, and the Liveness and Readiness endpoints have been refined, making it easier to integrate Spring Boot applications with Kubernetes.
Exposing the Internal State of Important Components #
In addition to using the state of the thread pool as an indicator of the overall health of the application, we can also expose the status data of important components within the application through the InfoContributor feature of Actuator. Here, I will demonstrate how to view the status data using the info HTTP endpoint and JMX MBean with an example.
Let’s take a specific case where we implement a ThreadPoolInfoContributor
to display information about the thread pool.
@Component
public class ThreadPoolInfoContributor implements InfoContributor {
private static Map<String, Object> threadPoolInfo(ThreadPoolExecutor threadPool) {
Map<String, Object> info = new HashMap<>();
info.put("poolSize", threadPool.getPoolSize());
info.put("corePoolSize", threadPool.getCorePoolSize());
info.put("largestPoolSize", threadPool.getLargestPoolSize());
info.put("maximumPoolSize", threadPool.getMaximumPoolSize());
info.put("completedTaskCount", threadPool.getCompletedTaskCount());
return info;
}
@Override
public void contribute(Info.Builder builder) {
builder.withDetail("demoThreadPool", threadPoolInfo(ThreadPoolProvider.getDemoThreadPool()));
builder.withDetail("ioThreadPool", threadPoolInfo(ThreadPoolProvider.getIOThreadPool()));
}
}
By accessing the /admin/info
endpoint, you can see this data:
Additionally, if JMX is enabled:
spring.jmx.enabled=true
You can use the jconsole tool and locate the Info MBean in org.springframework.boot.Endpoint
. By executing the info operation, you can see the information about the two thread pools that we just customized:
Furthermore, I would like to add one more thing. In addition to using jconsole to view and operate MBeans, you can use Jolokia to transform JMX into the HTTP protocol. To do this, you need to add the following dependency:
<dependency>
<groupId>org.jolokia</groupId>
<artifactId>jolokia-core</artifactId>
</dependency>
With Jolokia, you can execute the info operation of the org.springframework.boot:type=Endpoint,name=Info
MBean:
Metrics is the “golden key” for quickly identifying problems #
Metrics refers to a set of quantitative values that are associated with time and measure the ability of a certain dimension. By collecting metrics and displaying them as line charts, pie charts, and other graphs, we can quickly locate and analyze problems.
Let’s take an actual case to see how we can quickly identify problems through charts.
There is an order and delivery process for takeout orders, as shown in the diagram below. The OrderController performs the ordering operation. Before the ordering operation, it checks the parameters. If the parameters are correct, it calls another service to check the status of the merchant. If the merchant is in business, the order continues. After the order is successful, a message is sent to RabbitMQ for asynchronous delivery process. Then another DeliverOrderHandler listens to this message and performs the delivery operation.
For a business process involving synchronous calls and asynchronous calls, how can we quickly determine which link has a problem if the user reports a failed order?
At this time, the metrics system can be useful. We can establish some metrics to monitor the two important operations of order placement and delivery.
For the order placement operation, we can establish 4 metrics:
- Total Order Quantity, which monitors the current total number of orders in the system;
- Order Request, which increments by 1 for each order request received before processing;
- Successful Orders, which increments by 1 for each successful order completion;
- Failed Orders, which increments by 1 for each order operation that encounters an exception, and attaches the exception reason to the metric.
Similar to the order delivery operation, we can also establish 4 metrics. We can use the Micrometer framework to collect metrics. Micrometer is also the metric framework selected by Spring Boot Actuator. It abstracts various metrics, and the commonly used ones include:
- Gauge (red), which reflects the current value of the metric. For example, the total order quantity metric in this example, and the online number of players in a game or the current number of JVM threads can all be considered gauges;
- Counter (green), which increases by 1 for each method call and can be accumulated. For example, the order request metric in this example. For example, if we call a method 10 times in 5 seconds, Micrometer will send the metric to the backend storage system once every 5 seconds, and the value will be 10;
- Timer (blue), similar to counter, but in addition to recording the count, it also records the time, such as the successful order and failed order metrics in this example.
All metrics can be accompanied by tags as supplemental data. For example, when an operation fails, a reason tag will be attached to the metric.
In addition to abstracting metrics, Micrometer also abstracts storage. You can think of Micrometer as a framework similar to SLF4J, but the latter abstracts logs, while Micrometer abstracts metrics. Micrometer introduces various registries, which can seamlessly connect to various monitoring systems or time series databases.
In this case, we introduced the micrometer-registry-influx
dependency to introduce the core dependency of Micrometer and bind it to InfluxDB (InfluxDB is a time series database specializing in storing metric data) through Micrometer to store the metric data in InfluxDB:
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-influx</artifactId>
</dependency>
Then, modify the configuration file to enable the switch that exports metrics to InfluxDB, configure the address of InfluxDB, and set the frequency of aggregating metrics client-side and sending them to InfluxDB every second:
management.metrics.export.influx.enabled=true
management.metrics.export.influx.uri=http://localhost:8086
management.metrics.export.influx.step=1S
Next, we add relevant code to record metrics in the business logic.
Below is the implementation of the OrderController
. The code contains detailed comments, which I will not explain one by one. You need to pay attention to how to use the Micrometer framework to implement the four metrics: total order quantity, order request, successful orders, and failed orders, corresponding to lines 17, 25, 43, and 47 of the code:
@Slf4j
@RestController
@RequestMapping("order")
public class OrderController {
// Total number of order creations
private AtomicLong createOrderCounter = new AtomicLong();
@Autowired
private RabbitTemplate rabbitTemplate;
@Autowired
private RestTemplate restTemplate;
@PostConstruct
public void init() {
// Register createOrder.totalSuccess metric. To initialize a gauge metric, you only need to associate it with an AtomicLong reference like this.
Metrics.gauge("createOrder.totalSuccess", createOrderCounter);
}
// Order interface, with user ID and merchant ID as input parameters
@GetMapping("createOrder")
public void createOrder(@RequestParam("userId") long userId, @RequestParam("merchantId") long merchantId) {
// Record createOrder.received metric, which is a counter metric indicating the receipt of an order request.
Metrics.counter("createOrder.received").increment();
Instant begin = Instant.now();
try {
TimeUnit.MILLISECONDS.sleep(200);
// Simulate a situation where the user is invalid. An ID less than 10 is considered an invalid user.
if (userId < 10) {
throw new RuntimeException("invalid user");
}
// Query merchant service
...
}
Hope this helps!
Boolean merchantStatus = restTemplate.getForObject("http://localhost:45678/order/getMerchantStatus?merchantId=" + merchantId, Boolean.class);
if (merchantStatus == null || !merchantStatus)
throw new RuntimeException("closed merchant");
Order order = new Order();
order.setId(createOrderCounter.incrementAndGet()); // The gauge indicator can get automatic updates
order.setUserId(userId);
order.setMerchantId(merchantId);
// Send MQ message
rabbitTemplate.convertAndSend(Consts.EXCHANGE, Consts.ROUTING_KEY, order);
// Record the createOrder.success indicator once, which is a timer indicator that represents a successful order and provides the duration
Metrics.timer("createOrder.success").record(Duration.between(begin, Instant.now()));
} catch (Exception ex) {
log.error("createOrder userId {} failed", userId, ex);
// Record the createOrder.failed indicator once, which is a timer indicator that represents a failed order and provides the duration, and record the failure reason as a tag
Metrics.timer("createOrder.failed", "reason", ex.getMessage()).record(Duration.between(begin, Instant.now()));
}
// Merchant query interface
@GetMapping("getMerchantStatus")
public boolean getMerchantStatus(@RequestParam("merchantId") long merchantId) throws InterruptedException {
// Only merchant ID 2 is open
TimeUnit.MILLISECONDS.sleep(200);
return merchantId == 2;
}
When the user ID is less than 10, we simulate the case where the user data is invalid, and when the merchant ID is not 2, we simulate the case where the merchant is not open.
Next is the implementation of the delivery service in the DeliverOrderHandler
.
Among them, the deliverOrder
method listens for MQ messages emitted by OrderController
to simulate delivery. As shown in the following code, lines 17, 25, 32, and 36 implement the recording of four delivery-related indicators:
// Delivery service message handler
@RestController
@Slf4j
@RequestMapping("deliver")
public class DeliverOrderHandler {
// Delivery service running status
private volatile boolean deliverStatus = true;
private AtomicLong deliverCounter = new AtomicLong();
// Change the delivery status through an external interface to simulate delivery service stoppage
@PostMapping("status")
public void status(@RequestParam("status") boolean status) {
deliverStatus = status;
}
@PostConstruct
public void init() {
// Also register a gauge indicator deliverOrder.totalSuccess, representing the total number of deliveries, only need to be registered once
Metrics.gauge("deliverOrder.totalSuccess", deliverCounter);
}
// Listen for MQ messages
@RabbitListener(queues = Consts.QUEUE_NAME)
public void deliverOrder(Order order) {
Instant begin = Instant.now();
// Increment deliverOrder.received, indicating that an order message has been received, of counter type
Metrics.counter("deliverOrder.received").increment();
try {
if (!deliverStatus)
throw new RuntimeException("deliver outofservice");
TimeUnit.MILLISECONDS.sleep(500);
deliverCounter.incrementAndGet();
// Indicate successful delivery, of timer type
Metrics.timer("deliverOrder.success").record(Duration.between(begin, Instant.now()));
} catch (Exception ex) {
log.error("deliver Order {} failed", order, ex);
// Failed delivery metric deliverOrder.failed, also attaching the failure reason as tags, timer type
Metrics.timer("deliverOrder.failed", "reason", ex.getMessage()).record(Duration.between(begin, Instant.now()));
}
}
}
Meanwhile, we simulated a switch for the overall status of the delivery service, and the status interface can be called to modify its status. So far, we have completed the scene preparation and will now start configuring metric monitoring.
First, let’s install Grafana. Then enter Grafana to configure an InfluxDB data source:
After configuring the data source, you can add a monitoring panel and then add various monitoring charts in the panel. For example, in a chart of the number of orders placed, we added three metrics: receive orders, successful orders, and failed orders.
About the configuration in this image:
Red box, data source configuration, choose the configured data source.
Blue box, FROM configuration, chooses our metric name.
Green box, SELECT configuration, select the metric fields we want to query, and can also apply some aggregate functions. Here, we take the value of the count field and use the sum function for summation.
Purple box, GROUP BY configuration, we configure grouping by 1-minute time granularity and the reason field. This way, the Y-axis of the metric represents the QPM (queries per minute), and each failure case is plotted as a separate curve.
Yellow box, ALIAS BY configuration, sets the alias for each metric and references the reason tag in the alias.
For more detailed instructions on configuring InfluxDB metrics using Grafana, you can refer here. The meanings of FROM, SELECT, and GROUP BY are similar to SQL and should be easy to understand.
Similarly, we configure a complete business monitoring panel containing the 8 metrics we previously implemented:
Configure 2 Gauge charts to present the total number of orders completed and the total number of deliveries completed.
Configure 4 Graph charts to present the number and performance of order placement, as well as the number and performance of deliveries.
Now we move on to the practical part. We will use wrk to test four scenarios and analyze and locate the problems through the curves.
The first scenario is to use a valid user ID and a business merchant ID to run for a period of time:
wrk -t 1 -c 1 -d 3600s http://localhost:45678/order/createOrder\?userId\=20\&merchantId\=2
The operation of the entire system can be seen at a glance from the monitoring panel. It can be seen that the current system runs well, both order placement and delivery operations are successful, with an average processing time of 400ms for order placement and around 500ms for delivery operations, as expected (note that the green and yellow lines in the order placement curve actually overlap, indicating that all orders were successful):
The second scenario is to simulate using an invalid user ID for a period of time:
wrk -t 1 -c 1 -d 3600s http://localhost:45678/order/createOrder\?userId\=2\&merchantId\=2
Using an invalid user for order placement will obviously cause all orders to fail. Next, let’s see if we can see this phenomenon from the monitoring charts.
In the green box, we can see that the order placement now has a blue curve indicating “invalid user”, which matches the received order placement curve, indicating that all orders have failed due to an invalid user error, indicating that there is no problem at the source.
In the red box, although all orders have failed, the time for order placement has decreased from 400ms to 200ms, indicating that 200ms was consumed before the order failed (consistent with the code). And because the response time for the order failed operation has halved, the throughput has doubled.
By observing the two delivery monitoring charts, we can see that the delivery curve drops to 0 due to the order placement failure, as the order failed MQ messages are not sent out at all. Also, pay attention to the blue line, it can be seen that the delivery curve drops to 0 after the order placement success curve drops to 0, indicating that the delivery process is asynchronous. Although all orders failed from a certain point in time, there are still some messages in the MQ queue that have not been processed.
The third scenario is to try order placement failure due to the merchant being closed:
wrk -t 1 -c 1 -d 3600s http://localhost:45678/order/createOrder\?userId\=20\&merchantId\=1
I circled the parts that changed. You can try to analyze it yourself:
The fourth scenario is to stop delivery. We use curl to call the interface to set the delivery stop switch:
curl -X POST 'http://localhost:45678/deliver/status?status=false'
From the monitoring, we can see that since the switch was closed, all delivery messages have failed, the reason being “deliver outofservice”. The performance of the delivery operation has changed from around 500ms to 0ms, indicating that the delivery failure is a local fast failure and not due to service timeouts, etc. Although the delivery failed, the order placement operations have been normal:
Finally, I want to mention that in addition to manually adding business monitoring metrics, the Micrometer framework also automatically generates many metrics related to the JVM’s internal data. When entering the InfluxDB command-line client, you can see the following tables (metrics). The first 8 are the business metrics we created ourselves, and the rest are the metrics of the JVM and various component statuses generated by the framework:
> USE mydb
Using database mydb
> SHOW MEASUREMENTS
name: measurements
name
----
createOrder_failed
createOrder_received
createOrder_success
createOrder_totalSuccess
deliverOrder_failed
deliverOrder_received
deliverOrder_success
deliverOrder_totalSuccess
hikaricp_connections
hikaricp_connections_acquire
hikaricp_connections_active
hikaricp_connections_creation
hikaricp_connections_idle
hikaricp_connections_max
hikaricp_connections_min
hikaricp_connections_pending
hikaricp_connections_timeout
hikaricp_connections_usage
http_server_requests
jdbc_connections_max
jdbc_connections_min
jvm_buffer_count
jvm_buffer_memory_used
jvm_buffer_total_capacity
jvm_classes_loaded
jvm_classes_unloaded
jvm_gc_live_data_size
jvm_gc_max_data_size
jvm_gc_memory_allocated
jvm_gc_memory_promoted
jvm_gc_pause
jvm_memory_committed
jvm_memory_max
jvm_memory_used
jvm_threads_daemon
jvm_threads_live
jvm_threads_peak
jvm_threads_states
logback_events
process_cpu_usage
process_files_max
process_files_open
process_start_time
process_uptime
rabbitmq_acknowledged
rabbitmq_acknowledged_published
rabbitmq_channels
rabbitmq_connections
rabbitmq_consumed
rabbitmq_failed_to_publish
rabbitmq_not_acknowledged_published
rabbitmq_published
rabbitmq_rejected
rabbitmq_unrouted_published
spring_rabbitmq_listener
system_cpu_count
system_cpu_usage
system_load_average_1m
tomcat_sessions_active_current
tomcat_sessions_active_max
tomcat_sessions_alive_max
tomcat_sessions_created
tomcat_sessions_expired
tomcat_sessions_rejected
We can select some of these metrics according to our own needs and configure application monitoring panels in Grafana:
By using the monitoring charts to locate issues, it is much more convenient than using logs, right?
Key Review #
Today, I introduced to you several key points on how to implement production readiness using Spring Boot Actuator, including health checks, exposing application information, and monitoring metrics.
As the saying goes, “Sharpen your knife before you chop wood.” Health checks can help us achieve load balancing coordination. Application information and various endpoints provided by Actuator can help us view internal application details and even adjust some application parameters. Metrics monitoring, on the other hand, helps us observe the overall application performance and quickly discover and locate issues.
In fact, a complete application monitoring system generally consists of three aspects: logging, metrics, and tracing. I believe you already have a good understanding of logging and metrics. Tracing, which generally does not involve development work, has not been explained. Let me give you a brief introduction.
Tracing, also known as distributed tracing, is represented by open-source systems such as SkyWalking and Pinpoint. Typically, integrating with these systems does not require additional development. By using the javaagent provided by these systems to start a Java program, various components can be dynamically modified by bytecode manipulation to include tracing code (similar to AOP).
The principle of distributed tracing is as follows:
- When a request enters the first component, a TraceID is generated as a unique identifier for the entire call chain (Trace).
- For each operation, the time consumption and related information are recorded, forming a Span attached to the call chain. Spans can also be associated in a tree-like structure. When there is a remote call or cross-system call, the TraceID is transmitted (e.g., the TraceID can be passed through the request in an HTTP call or through messages in MQ).
- These data are aggregated and submitted to the database, and the entire tree-like call chain can be queried through a UI interface.
At the same time, we usually record the TraceID in the logs to facilitate the association between logs and tracing.
I have compared the differences and characteristics of logging, metrics, and tracing with a diagram:
In my opinion, a complete monitoring system requires all three components. They can also work together, for example, using metrics to identify performance issues, using tracing to locate the problematic application and operations causing the performance issues, and finally using logs to identify the specific details of the requests.
I have put the code used today on GitHub, and you can click on this link to view it.
Reflection and Discussion #
Spring Boot Actuator provides a large number of built-in endpoints. What is the difference between an endpoint and a custom @RestController? Can you develop a custom endpoint based on the official documentation?
In the introduction of Metrics, we saw that InfluxDB stores some application metrics automatically collected by the Micrometer framework for us. Can you refer to the two JSON files for Grafana configurations in the source code and configure a complete application monitoring dashboard in Grafana with these metrics?
Before the application is put into production, what other production-ready work would you do? I am Zhu Ye. Feel free to leave a comment in the comment section and share your thoughts. You are also welcome to share today’s content with your friends or colleagues for further discussion.