27 Stability Scenario Second How to Deal With the Bottle Neck Issues Caused by Insufficient Disk Space

27 Stability scenario second how to deal with the bottle-neck issues caused by insufficient disk space #

Hello, I’m Gao Lou.

In the previous class, we explained the two key points of stability scenarios: runtime and stress level. Through the example system in the course, we guided you through the specific operation of stability scenarios.

In the first phase of targeted analysis, we analyzed the problem of operating system OOM caused by excessive overcommitment of virtual machine memory. We found that it was caused by excessive overcommitment in the configuration. After we reduced the memory of the virtual machine, the stability scenario ran for more than 12 hours, with an accumulated business volume of over 72 million. This result has already achieved our goal.

However, due to greed, I did not stop the scenario, and new problems occurred while it continued to run… Therefore, today we will enter the second phase of targeted analysis to see what other problems await us.

Directed Monitoring and Analysis #

Directed Analysis Phase 2 #

As the scene continued to run, I observed the following data:

From the graph, it is evident that TPS has dropped and response time has increased during the continuous operation of the scene.

Let’s take a look at the total accumulated business volume at this time:

This means that there has been an increase of over 200,000 units in the total accumulated business volume.

Seeing a problem without analyzing it can be uncomfortable, so let’s analyze it.

Following the performance analysis decision tree, let’s go through the counters one by one. When I checked the MySQL Pod logs, I found that it was constantly being deleted and rebuilt:

Please note that this is just an example system and for the purpose of easy rebuilding, I placed MySQL in a Pod. If this were a real environment, I would advise you to configure the database based on the actual production setup.

At this point, let me respond to a certain opinion in the industry: databases should not be placed in Kubernetes Pods. I’m not sure who holds this view, whether it’s just a perception of insecurity or they have actually encountered problems. In my experience, many systems do not have high performance requirements for databases and the business volume is not large. Using containers to manage the databases makes migration and rebuilding easier, and there is no performance issue either. So, in this scenario, using Pods is perfectly fine.

Of course, some systems are more demanding in terms of database performance and choose to deploy directly on physical machines to ensure higher performance. If the system you are facing really requires creating a database on a physical machine, then go ahead with that option. If Pods meet the requirements, I don’t think there is any need to worry, just use Pods.

Because the MySQL Pod was constantly being deleted and rebuilt, and MySQL is located in worker-1, let’s take a look at the operating system logs for worker-1:

Feb 22 00:43:16 k8s-worker-1 kubelet: I0222 00:43:16.085214    1082 image_gc_manager.go:304] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 7213867827 bytes down to the low threshold (80%).

It turns out that the disk assigned to worker-1 is full, which explains why the MySQL Pod is constantly being deleted and rebuilt.

Let’s check the disk quota:

[root@k8s-worker-1 ~]# df -h
文件系统                 容量  已用  可用 已用% 挂载点
devtmpfs                 6.3G     0  6.3G    0% /dev
tmpfs                    6.3G   24M  6.3G    1% /dev/shm
tmpfs                    6.3G   67M  6.3G    2% /run
tmpfs                    6.3G     0  6.3G    0% /sys/fs/cgroup
/dev/mapper/centos-root   47G   45G  2.7G   95% /
/dev/vda1               1014M  304M  711M   30% /boot
tmpfs                    1.3G  4.0K  1.3G    1% /run/user/42
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/9962f8d2-f6bb-4981-a073-dd16bfa9a171/volumes/kubernetes.io~secret/kube-proxy-token-vnxh9
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/f5872331-14b1-402b-99e0-063834d834fa/volumes/kubernetes.io~secret/calico-node-token-hvs7q
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/e61e5b2232592ef9883861d8536f37153617d46735026b49b285c016a47179cf/merged
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/4c057d86c1eabb84eddda86f991ca3852042da0647fd5b8c349568e2a0565591/merged

It appears that the disk assigned to /dev/mapper/centos-root is 95% full, which is why the MySQL Pod is constantly being deleted and rebuilt.

```
shm                       64M     0   64M    0% /var/lib/docker/containers/f1e8c983be46895acc576c1d51b631bd2767aabe908035cff229af0cd6c47ffb/mounts/shm
shm                       64M     0   64M    0% /var/lib/docker/containers/c7e44cdfc5faa7f8ad9a08f8b8ce44928a5116ccf912fbc2d8d8871ab00648a5/mounts/shm
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/a685a1652586aca165f7f159347bf466dd63f497766762d8738b511c7eca1df3/merged
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/b7da3fde04f716a7385c47fe558416b35e471797a97b28dddd79f500c62542f2/merged
tmpfs                    1.3G   36K  1.3G    1% /run/user/0
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/d01f8686-e066-4ebf-951e-e5fe9d39067d/volumes/kubernetes.io~secret/node-exporter-token-wnzc6
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/5215bd987a62316b3ebb7d6b103e991f26fffea4fe3c05aac51feeb44ab099ab/merged
shm                       64M     0   64M    0% /var/lib/docker/containers/d0cf9df15ac269475bb9f2aec20c048c8a61b98a993c16f5d6ef4aba2027326a/mounts/shm
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/aa5125b01d60b19c75f3f5d018f7bb51e902264580a7f4033e5d2abaaf7cc3f6/merged
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/3a7d3d4cddc51410103731e7e8f3fbcddae4d74a116c88582557a79252124c5d/merged
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/34f60184-07e5-40da-b2cb-c0295d560d54/volumes/kubernetes.io~secret/default-token-7s6hb
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/940009fca9f57e4b6f6d38bab584d69a2f3ff84153e3f0dfd3c9b9db91fa2b30/merged
shm                       64M     0   64M    0% /var/lib/docker/containers/12c6a27bb53a4b0de5556a960d7c394272d11ceb46ac8172bd91f58f762cde14/mounts/shm
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/366007a9f82dfb9bd5de4e4cadf184cba122ef2070c096f393b7b9e24ae06a98/merged
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/251e9c86-4f25-42bd-82a0-282d057fe238/volumes/kubernetes.io~secret/nginx-ingress-token-cbpz9
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/1defd5a0004201a0f116f48dd2a21cba16647a3c8fdfde802fb5ea1d3e5591ff/merged
shm                       64M     0   64M    0% /var/lib/docker/containers/459bf58b1cafcc9ab673d30b92ae815a093e85593ab01921b9ba6e677e36fe45/mounts/shm
overlay                   47G   45G  2.7G   95% /var/lib/docker/overlay2/49197bcd5b63e30abc94315b0083761a4fd25ebf4341d2574697b84e49350d53/merged
```

如您所见,磁盘使用量已经达到了95%,超过了预设的限制,所以Pod将会被删除。

解决方案是增加磁盘空间,我们将再增加100G的存储空间:
[root@k8s-worker-1 ~]# df -h
Filesystem              Size  Used Avail Use% Mounted on
devtmpfs                6.3G     0  6.3G   0% /dev
tmpfs                   6.3G     0  6.3G   0% /dev/shm
tmpfs                   6.3G   19M  6.3G   1% /run
tmpfs                   6.3G     0  6.3G   0% /sys/fs/cgroup
/dev/mapper/centos-root  147G   43G  105G  30% /
/dev/vda1               1014M  304M  711M  30% /boot
tmpfs                   1.3G  4.0K  1.3G   1% /run/user/42
tmpfs                   6.3G   12K  6.3G   1% /var/lib/kubelet/pods/d01f8686-e066-4ebf-951e-e5fe9d39067d/volumes/kubernetes.io~secret/node-exporter-token-wnzc6
tmpfs                   6.3G   12K  6.3G   1% /var/lib/kubelet/pods/9962f8d2-f6bb-4981-a073-dd16bfa9a171/volumes/kubernetes.io~secret/kube-proxy-token-vnxh9
tmpfs                   6.3G   12K  6.3G   1% /var/lib/kubelet/pods/251e9c86-4f25-42bd-82a0-282d057fe238/volumes/kubernetes.io~secret/nginx-ingress-token-cbpz9
tmpfs                   6.3G   12K  6.3G   1% /var/lib/kubelet/pods/f5872331-14b1-402b-99e0-063834d834fa/volumes/kubernetes.io~secret/calico-node-token-hvs7q
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/7380ac7d8f83ba37ddae785e5b4cd65ef7f9aa138bfb04f86e3c7f186f54211a/merged
shm                      64M     0   64M   0% /var/lib/docker/containers/3c90444a51820f83954c4f32a5bc2d1630762cdf6d3be2c2f897a3f26ee54760/mounts/shm
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/c841e85e88fdcfe9852dcde33849b3e9c5a229e63ee5daea374ddbc572432235/merged
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/147a81e8a50401ec90d55d3d4df3607eb5409ffe10e2c4c876c826aa5d47caf0/merged
shm                      64M     0   64M   0% /var/lib/docker/containers/9e2c04b858025523e7b586fe679a429ac49df3711881261cda40b158ad05aebf/mounts/shm
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/83e01c8cda50233088dc70395a14c861ac09ce5e36621f1d8fdd8d3d3e0a7271/merged
shm                      64M     0   64M   0% /var/lib/docker/containers/f23362117532f08ff89f937369c3e4d2039d55a9ba51f61e41e62d725b24e3a1/mounts/shm
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/0cfe0dbd0c633e13a42bd3d69bd09ea51ab4354d77a0e6dcf93cabf4c76c3942/merged
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/7b83010457d86cecf3c80ebc34d9db5d26400c624cba33a23f0e9983f7791aef/merged
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/0f31c96b1961d5df194a3710fdc896063a864f4282d7a287b41da27e4d58a456/merged
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/f67a6de6a1b18d4748581230ed7c34c8f16d8f0dd877a168eb12eacf6bf42f05/merged
shm                      64M     0   64M   0% /var/lib/docker/containers/e3eb1ea1785e35045213518dd6814edcd361b501748b8e6bdede20c8961062d2/mounts/shm
overlay                 147G   43G  105G  30% /var/lib/docker/overlay2/5cec0d1a7723dfcb0e5eaf139f4965a220575557795ad2959ce100aa888dc12b/merged
tmpfs                    1.3G   32K  1.3G    1% /run/user/0
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/704657eb-ea28-4fb0-8aee-c49870e692d3/volumes/kubernetes.io~secret/default-token-7s6hb
overlay                  147G   43G  105G   30% /var/lib/docker/overlay2/da4073b560a2ce031fa234624c09757b65eb7b6cfc895186dbf8731e2d279fee/merged
shm                       64M     0   64M    0% /var/lib/docker/containers/76a6814a838778049495e9f8b2b93e131d041c8f90e8dea867d3c99fa6ca918b/mounts/shm
tmpfs                    6.3G   12K  6.3G    1% /var/lib/kubelet/pods/a73b0bc6-76fc-4e2a-9202-380397399b76/volumes/kubernetes.io~secret/default-token-7s6hb
overlay                  147G   43G  105G   30% /var/lib/docker/overlay2/cea9c51e718964cc46824ba51ff631a898402318c19e9603c6d364ac3bed8a27/merged
shm                       64M     0   64M    0% /var/lib/docker/containers/d936e646d12f7b8381a36e8a11094d76a0a72d95f84edf3f30c7e8b3981264e0/mounts/shm
overlay                  147G   43G  105G   30% /var/lib/docker/overlay2/4c210c659428999d000676fde7f1c952f1f43d68b63b308fa766b0ce41568f06/merged
[root@k8s-worker-1 ~]#

This perfectly solves the problem of the MySQL Pod being continuously deleted and rebuilt, doesn't it!

However, we were a bit too happy. Just as I was running this stability scenario again, before the results were available, a new problem occurred! So we had to move on to the third phase of targeted analysis.

### Third Phase of Targeted Analysis

In the previous phase, we found that the MySQL Pod was being deleted and rebuilt. So when the scenario ran again, I directly checked the status of all Pods in each namespace, and found the following information:

![](../images/df3c0fd1226e40c682dc4f8c7a0afdfe.jpg)

For some unknown reason, the Order container is constantly being evicted.

By checking the logs, I found the following information:

![](../images/f041136f34f144688f5f2235845d9a1b.jpg)

DiskPressure! This means that the storage on the worker where the Order service resides is being used up. In the virtual machine of the Order application, by checking the Kubernetes logs, I discovered the following error message:

kubelet: W0222 14:31:35.507991    1084 eviction_manager.go:344] eviction manager: attempting to reclaim ephemeral-storage

This means that the storage for the Order application has been used up and an attempt to expand the storage is being made. So we checked the Kubernetes logs and found the hint "The node was low on resource: ephemeral-storage".

For the Order service, it's just logging, right? All the other application methods are using CPU. So why would the Order service use so much disk space? Let's check the logs and found the following error message:

[2021-02-25 20:48:22.022] [org.apache.juli.logging.DirectJDKLog] [http-nio-8086-exec-349] [175] [ERROR] Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.jdbc.UncategorizedSQLException: 
### Error querying database.  Cause: java.sql.SQLException: Index oms_order is corrupted
### The error may exist in URL [jar:file:/app.jar!/BOOT-INF/lib/mall-mbg-1.0-SNAPSHOT.jar!/com/dunshan/mall/mapper/OmsOrderMapper.xml]
### The error may involve com.dunshan.mall.mapper.OmsOrderMapper.selectByExample-Inline
### The error occurred while setting parameters
### SQL: SELECT count(0) FROM oms_order WHERE (delete_status = ? AND member_id = ?)
### Cause: java.sql.SQLException: Index oms_order is corrupted
; uncategorized SQLException; SQL state [HY000]; error code [1712]; Index oms_order is corrupted; nested exception is java.sql.SQLException: Index oms_order is corrupted] with root cause
java.sql.SQLException: Index oms_order is corrupted

There are many of these errors, and they are increasing rapidly. Is it possible that the Order table is corrupt?

In the Order service logs, it is clear that there is an SQL execution error. However, we have already increased the disk space of the MySQL machine, so it should not be caused by insufficient disk space for MySQL. So let's take a look at the MySQL error logs:
156 2021-02-25T11:26:35.520327Z 0 [Note] InnoDB: Uncompressed page, stored checksum in field1 2147483648, calculated checksums for field1: crc32 1193184986/3495072576, innodb 846701958,         none 3735928559, stored checksum in field2 2147483659, calculated checksums for field2: crc32 1193184986/3495072576, innodb 810726412, none 3735928559,  page LSN 0 0, low 4 bytes of         LSN at page end 3836608512, page number (if stored to page already) 327680, space id (if created with >= MySQL-4.1.1 and stored already) 0
157 InnoDB: Page may be a freshly allocated page
158 2021-02-25T11:26:35.520373Z 0 [Note] InnoDB: It is also possible that your operating system has corrupted its own file cache and rebooting your computer removes the error. If the corrupt page is an index page. You can also try to fix the corruption by dumping, dropping, and reimporting the corrupt table. You can use CHECK TABLE to scan your table for corruption. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.
159 2021-02-25T11:26:35.520408Z 0 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=779484, page number=3836674048], should be [page id: space=111, page number=150922]
160 2021-02-25T11:26:35.520415Z 0 [ERROR] InnoDB: Database page corruption on disk or a failed file read of page [page id: space=111, page number=150922]. You may have to recover from a backup.
161 2021-02-25T11:26:35.520420Z 0 [Note] InnoDB: Page dump in ascii and hex (16384 bytes):
.............

From the above information, it seems that the database file is corrupt, resulting in an error.

Let’s try performing operations on this table directly in MySQL to see if an error occurs. The result is as follows:

[Err] 1877 - Operation cannot be performed. The table 'mall.oms_order' is missing, corrupt or contains bad data.

This error log suggests that the operating system’s file may have encountered problems and recommends rebooting the operating system. Since there is a recommendation from the system, let’s restart it.

However, nothing is certain in this world, and there are pitfalls everywhere. When we restarted the operating system, we encountered this screen:

Feel like crying, right? Now we can’t even find the disk…

Thinking back to the operation on the disk just now, we used the following command to increase the disk space directly:

qumu-img resize imagename +100G

After consulting the official documentation, I found that although this command can expand the disk, the virtual machine needs to be shut down first, otherwise, the disk may encounter problems. And we didn’t shut down the virtual machine at that time. There’s nothing we can do but rebuild the virtual machine.

After a series of twists and turns, we started the database again, waited for all the applications to connect normally, and then ran the stability scenario again. We saw the following results:

The accumulated business volume at this time is:

From the scenario data, we can see that the TPS has decreased. This is because during the pressure process, all the data kept accumulating. The more we accumulate, the slower the database operations will be. We can know the resource usage of the database by looking at the worker-1:

As you can see, the system load and IO throughput in it have been affected by the increasing stability running time and increasing data volume. Our subsequent optimization ideas are also very clear, which is to ensure the data volume of the current table, such as sharding, regular archiving, and other methods can be used. These operations are routine.

In the current scenario, since the stability goal has been achieved, we won’t continue further. After all, performance optimization has no end.

Summary #

As we mentioned earlier, in stability scenarios, the two key indicators we need to focus on are “runtime” and “business accumulation”. Once these two indicators are reached, the stability scenario can be concluded.

In this lesson, we analyzed two issues caused by long-running processes:

  • Insufficient disk space

This is a common problem in stability scenarios. Therefore, it is best to estimate how much disk space will be used before running the stability scenario. For things like log files, make sure to implement log rotation and don’t keep them for too long. If you want to keep them, move them elsewhere.

  • Database file corruption

This problem is caused by expanding disk space. Although the disk appears to have been successfully expanded during the operation, improper operations lead to database file corruption. Although this is an operational issue, we did not see any error messages or reminders during the operation.

In addition, we also need to fully consider the accumulation of the database. I suggest using calculations to determine what impact an increase in database size will have on the application. The disk should be able to accommodate the current amount of data in the database tables and ensure good query performance during data growth.

I cannot list all the issues that may arise in stability scenarios here. I can only provide you with an analysis approach through actual cases for your reference. In your application scenario, you can analyze specific issues based on the RESAR performance engineering theory.

In summary, stability scenarios are a severe test for systems in performance engineering. “Designing stability scenarios sensibly and ensuring they meet production business scenarios” is the goal we must achieve.

Homework #

That’s all for today. Finally, I have two questions for you to think about:

  1. Have you encountered any problems in stability scenarios caused by continuous accumulation of business?
  2. In stability scenarios, how can we ensure that disk usage does not continue to increase with the scenario, but remains at a certain level of usage?

Remember to discuss and exchange your thoughts with me in the comments section. Every thought will help you make further progress.

If you have gained something from this lesson, feel free to share it with your friends and learn and progress together. See you in the next lesson!