22 Storage Costs How to Calculate the Implementation Costs in Log Centers

22 Storage Costs How to Calculate the Implementation Costs in Log Centers #

Hello, I am Xu Changlong.

Earlier, we compared many technologies, and if you paid close attention, you would have noticed that we often consider the implementation cost when making comparisons. This is because being meticulous in technology selection can help us save a lot of money. Have you ever thought about how to calculate the cost systematically?

In this lesson, I will guide you through the cost calculation using the example of a log center.

There are two main reasons for choosing a log center: one is that it is important and universal. As a core component of system monitoring, almost all system monitoring and troubleshooting depends on a log center, and most systems make use of it. Another reason is that a log center is costly and the calculation is also quite complex. If you can grasp the examples in this course, it will be much easier to calculate the cost of other components using a similar approach in the future.

Calculating Storage Capacity and Investment Cost Based on Traffic #

In internet services, the biggest variable lies in user traffic. Compared to regular services, high-concurrency systems need to serve a larger number of online users simultaneously. Therefore, when designing the capacity for such systems, we need to calculate how much hardware investment is required based on user request volume and simultaneous online users.

Many systems initially use cloud services to implement their log centers, but once the core API traffic exceeds 100,000 QPS, many companies consider building their own data centers. They may even continue to improve the log center and create customized services.

In fact, these optimizations and implementations are essentially closely related to cost. You may not understand this, so let’s calculate the storage capacity and cost of a website’s log center using an example.

Typically, the QPS (Queries Per Second) of the core APIs for a high-concurrency website during peak hours is around 300,000. Let’s calculate based on 8 hours per day and assume that each core API request generates a 1KB log. In that case, we can calculate the daily request volume and daily log data volume as follows:

  • Daily request volume = 3600 seconds X 8 hours X 300,000 QPS = 8,640,000,000 requests/day = 8.6 billion requests/day
  • Daily log data volume: 8,640,000,000 X 1KB => 8.6TB/day

You may wonder why we calculate based on 8 hours per day. This is because the user traffic of most websites tends to follow patterns, with some websites experiencing peak traffic during commute hours and at night, while others have concentrated traffic during working hours. Combining this with the fact that each person only has around 8 hours of focused time per day, it is reasonable to calculate based on 8 hours per day.

Of course, this value is for reference only, and different businesses may have different performance. You can adjust this value based on your own website’s user habits using this line of thinking.

Let’s go back to the topic at hand. Based on the calculation above, we can see that if each request generates a 1KB log, then 8TB of logs need to be captured, transmitted, organized, processed, and stored every day. In order to facilitate problem tracing, we also need to set the log retention period, which we’ll assume to be 30 days. Therefore, the monthly log volume would be 258TB of log storage, calculated as follows:

8.6TB X 30 days = 258TB / 30 days

Calculating Disk Investment Based on Capacity #

After calculating the log volume, we can further estimate how much money is needed to purchase hardware.

I want to mention in advance that hardware prices are constantly changing, and prices may vary among different vendors. So the specific prices will differ. Here, we will focus on understanding the calculation methodology. Once you understand it, you can estimate based on your own actual situation.

The current price for a common server hard drive (8TB, 7200 RPM, 3.5 inches) is 2300 yuan. The actual usable capacity of an 8TB hard drive is 7.3TB. Considering the monthly log volume mentioned earlier, we can calculate the number of hard drives required. The calculation is as follows:

258TB / 7.3TB = 35.34 drives

Since the number of drives must be an integer, we need 36 drives. Multiplying the quantity by the price gives us the amount spent on purchasing hardware, which is:

2300 yuan X 36 = 82,800 yuan

To ensure data safety and enhance query performance, we often use distributed storage services to store data in three copies. In a distributed storage solution, we need a minimum of 108 drives. Therefore, we can calculate the investment cost as follows:

82,800 X 3 data replicas = 248,000 yuan

If we want to ensure data availability, we need to use RAID 5 for the hard drives. This means grouping a few hard drives together to provide external services, with some used for full capacity and the rest used for parity checking. However, there are various ratios for this method. For ease of calculation, we will choose the following ratio: a group of four disks, with three providing complete capacity and one serving as parity.

The formula for calculating capacity in RAID 5 is as follows:

  • Capacity of a single RAID 5 group = ((n-1)/n) * Total disk capacity, where n is the number of disks

Substituting the number of disks into the formula, we have:

((4-1)/4) X (7.3T X 4) = 21.9T = Capacity of three 8TB hard drives

This result indicates that in a RAID 5 group of four disks, three can provide complete capacity. Therefore, we need to increase the capacity by 1/4, resulting in:

108 / 3 = 36 parity disks The total number of hard drives required will be 108 + 36 RAID5 parity drives = 144 drives, with each drive costing 2300 yuan. The total cost is:

144 X 2300 yuan = 331,200 yuan

For convenience, let’s round it up to 330,000 yuan.

In addition to availability, we also need to consider the lifespan of the hard drives. Since hard drives are prone to failure, after working continuously for two to three years, bad sectors will start to appear one after another. Due to slow delivery, out of stock issues, and logistics problems, we need to keep about 40 hard drives in stock for replacements (most companies keep one-third of the total number of hard drives as spare). The approximate maintenance cost for this is 2300 yuan X 40 = 92,000 yuan.

So far, we need to invest at least in hardware costs, which include the one-time purchase cost of the hard drives and maintenance costs, i.e. 33 + 9.2 = 42W yuan.

Calculating Server Investment based on Hard Drives #

Next, we need to calculate the costs for the servers. Since there are different server specifications, and each specification can accommodate a different number of hard drives, the situation is as follows:

  • A regular 1U server can accommodate 4 3.5-inch hard drives and 2 SSD hard drives.
  • A regular 2U server can accommodate 12 3.5-inch hard drives and 6 SSD hard drives.

In the previous section, we calculated the hard drive requirements, and in the case of using 2U servers, we would need 12 servers (144 hard drives / 12 = 12 servers).

Let’s calculate the hardware investment cost for the servers, considering each server costs 30,000 yuan, as follows:

12 servers X 30,000 = 360,000 yuan

On another note, the replicas of the same data should be deployed in different cabinets and switches, with the aim of improving availability.

Calculating Maintenance Costs based on Server Hosting #

Okay, let’s get back to the topic of calculating costs. Besides purchasing the servers, we also need to calculate the maintenance costs.

If we host the 2U servers in a better data center, the average hosting cost per server per year is about 10,000 yuan. Earlier, we calculated that we would need 12 servers, so the annual hosting cost is 12,000 yuan.

Now let’s calculate the initial investment for the first year, which includes the hard drive investment and maintenance costs, server hardware costs, hosting costs, and broadband costs. The calculation formula is as follows:

Initial investment cost = 42W yuan (hard drive purchase and spare drives) + 36W yuan (one-time server investment) + 12W yuan (server hosting cost) + 10W yuan (broadband cost) = 100W yuan

The annual maintenance costs, including hard drive replacement costs (assuming all spare drives are used), server maintenance costs, and broadband costs, are calculated as follows:

9.2W yuan (spare hard drives) + 12W yuan (annual hosting) + 10W yuan (annual broadband) = 31.2W yuan

Based on the initial investment cost and the annual maintenance costs, we can calculate the total cost for running the core service (a 30W QPS website) for three years, as follows:

31.2W X 2 years = 62.4W + the initial investment of 100W = 162.4W yuan

Of course, these prices do not take into account discounts for purchasing hardware in bulk, redundancy of service capacity, and other costs such as network equipment, adapter cards, and labor costs. Even if we ignore these factors, after calculating the previous costs, and considering a scenario of running ELK on 2000 servers, you should have realized how expensive it is to add one more log line.

Server Purchase Redundancy #

Next, let’s talk about the need for purchasing servers with redundancy. If you haven’t experienced this firsthand, it is easy to overlook this aspect.

If you are hosting in a core data center, you need to consider the server procurement and installation cycle. Due to a shortage of available rack space in many core data centers, many companies purchase additional servers in advance to prepare for business growth in the coming years. Some companies used to prepare four times the estimated number, but the redundancy ratio cannot be universally determined due to different growth rates of different enterprises.

Personally, I prefer to evaluate the three-year server procurement quantity based on the current traffic growth trend. So, going back to the server cost calculation we did earlier, we only calculated the system’s traffic capacity that was sufficient for the time being. This is already quite thrifty. When making estimates, you must consider redundancy.

How to save storage costs? #

Generally speaking, businesses have a growth period. When our business is in a rapid development and iteration stage, it is recommended to invest more in hardware to support the business. Once our business form and market stabilize, we need to start thinking about how to reduce costs while ensuring service quality.

Temporary solutions for handling traffic #

If the server purchase does not leave any redundancy and the service traffic grows, what temporary solutions do we have?

We can start from two directions: saving server storage or reducing log volume. Here are some ways:

  • Reduce the log retention period from 30 days to 7 days, which can save three-quarters of the space.
  • Separate logs between non-core business and core business, storing non-core business logs for only 7 days and core business logs for 30 days.
  • Reduce the log volume by investing manpower in analysis. We can appropriately reduce the output of troubleshooting logs for stable business.
  • If there are more servers or fewer disks, and the server CPU is not under much pressure, data can be compressed to save half of the disk space.

These temporary solutions can indeed solve our urgent needs for the time being. However, when saving costs, it is recommended not to sacrifice business services, especially core business. Next, let’s discuss a special situation.

If the traffic during the peak period of the business surges, far exceeding 300,000 QPS, there will be a sudden increase in traffic or a large number of faults. Even the log center that provides no error services will be affected and anomalies will occur.

During the peak period, the logs will be delayed for half an hour or even a day, and the ultimate consequence is that the system alarms are not timely, and even if we investigate the problem, we cannot find real-time fault information, which will seriously affect the operation of the log center.

The occurrence of the above situation is because the log center generally adopts a shared multi-tenant approach with poor isolation. At this time, the logs of individual systems will start reporting errors frantically, occupying all the resources of the log center. In order to avoid this risk, some core services usually use a separate log service independent of peripheral business to ensure timely monitoring of core services.

Hot and cold separation of storage for high-concurrency writing #

In order to save costs, we can also make efforts from a hardware perspective. If our service cycle has peak periods and the traffic is not high during non-peak times, it is a waste to purchase too many servers. At this time, using high-performance hardware to withstand the pressure during peak periods can save more costs.

For example, the write performance of a single disk is about 200MB/s. After RAID 5, the performance of a single disk is halved, so the write performance is 100MB/s x 9 hard disks available on one server = 900MB/s write performance. If it is a real-time writing and reading less log center system, this disk throughput is barely enough.

However, in order for our log center to withstand extreme high peak traffic pressure, it often requires a few more steps. So let’s continue to deduce here, if the real-time write traffic surges and exceeds our estimate, how to respond quickly?

Generally speaking, to deal with this situation, we can do hot and cold separation. When the demand for writes increases dramatically, SSDs are used to handle a large number of writes, while regular hard disks are used for storing cold data. If there are 8TB of new logs in a day and one copy on 4 servers, then each server needs to handle at least 2TB/day of storage.

The unit price of a 1TB SSD disk with an actual capacity of 960G and an M.2 interface is 1800 yuan, and its sequential write performance is approximately 3-5GB/s (rough data).

Each server needs to buy two SSD disks, for a total of 24 **1TB SSDs (**not counting the cost of the adapter card for now). The initial investment in purchasing SSDs is 43,200 yuan, and the calculation process is as follows: 1800 yuan X 12 servers X 2 SSDs = 43200 yuan

Similarly, SSDs need to be replaced regularly, with a lifespan of about three years. The annual maintenance cost is 1800 X 8 = 14400 yuan

I would like to add some additional information here. In addition to improving write performance, SSDs can also improve read performance. Some distributed retrieval systems can provide automatic hot-cold migration functionality.

How many network cards are more cost-effective #

By separating SSDs and hot-cold data, the write pressure of peak logs can be delayed. However, when our server disks can handle the traffic, another bottleneck will gradually emerge, which is the network.

Generally speaking, our internal network speed is not too bad, but some small self-built machine room internal network bandwidth is 10 Gigabit (Gbps) switch, and servers can only use Gigabit (Gbps) network cards.

Theoretically, the transmission speed of a Gigabit network card is 1000mbps/8bit= 125MB/s. However, in reality, it cannot reach the theoretical speed. The actual test transmission speed of a Gigabit network card is about 100MB/s. Therefore, when we perform large data file transfers within the internal network, the network bandwidth is often fully utilized.

In earlier times, in order to improve network throughput, methods such as multiple network cards connected to a switch and server bonding were used to increase network throughput.

After the popularity of fiber optic network cards, 10 Gigabit (10 Gbps) optical interface network cards are commonly used now, with a transmission performance of up to 1250MB/s (10000mbps/8bit = 1250MB/s). However, the actual speed cannot reach the theoretical value and can only reach around 900MB/s, which is equivalent to 7200 mbps.

Let’s go back and see how much data throughput the previously mentioned peak period logs have. It is calculated as follows:

300,000 QPS * 1KB = 292.96MB/s

As mentioned earlier, the speed of a Gigabit network card is 100MB/s. In this way, the four servers can barely handle the load. However, if there is a multiple of the peak traffic, it will not be enough. Therefore, upgrading the network equipment is necessary, which means switching to a 10 Gigabit network card.

However, a 10 Gigabit network card needs to be used with a better layer 3 switch to achieve its performance. This kind of switch has become popular in recent years, and the cost of the switch is included in the infrastructure construction. Therefore, the investment cost of the switch will not be calculated separately here.

When calculating the hardware cost earlier, we mentioned that each group of servers needs to store three replicas, so three 10 Gigabit optical network cards are enough. However, for stability, we will not allow the network cards to be fully utilized for external services. The optimal transmission speed should be around 300-500 MB/s, leaving the remaining bandwidth for other services or emergency use. Here, I recommend a configuration that limits network traffic called QoS. If you are interested, you can learn more about it later.

With 12 servers divided into 3 groups (each group stores one full copy of the data), with 4 servers in each group, and 1 10 Gigabit network card installed on each server, the network throughput of each server during normal operation is:

292.96MB/s (data throughput of peak period logs) / 4 servers = 73MB/s

It can be said that using a 10 Gigabit network card can satisfy the daily log transmission needs with only one-tenth of the capacity, while a Gigabit network card is not enough. You may have a question when you see this. The speed of a Gigabit network card is not 100MB/s, but the calculated throughput is 73MB/s. Why is it not enough?

This is because when estimating capacity, we must leave room for flexibility. If a Gigabit network card is used, it will be close to full capacity. Once there is a slight fluctuation, it will cause lag, severely affecting the stability of the system.

On the other hand, in actual usage, the log center is not only used to meet basic business needs but also serves the function of troubleshooting. It is also used for data mining and analysis. Otherwise, it would not be cost-effective to invest so much in building a log center.

We usually use the idle resources of the log center for limited-speed data mining. In this regard, I believe you understand why we save three copies of the logs. The purpose is to improve concurrent computing capabilities through multiple replicas. However, the focus of this lesson is to demonstrate how to calculate costs, so we will end our discussion here. If you are interested, you can explore it on your own after class.

Summary #

In this lesson, we mainly discussed how to estimate the amount of logs based on user requests in order to calculate the number of servers and the associated costs.

Image

Take a moment to think about any deficiencies in the calculation process.

In reality, this calculation only satisfies the existing traffic needs of the business. In practice, estimating the required resources is more rigorous and takes into account additional factors. For example, after obtaining the current traffic calculation results, we also need to consider future growth. This is because data centers have limited space. If we cannot plan server resources six months in advance, and user traffic suddenly increases without sufficient hardware resources, we would have to resort to software optimization to handle unexpected situations.

Of course, calculating the investment in disks and servers based on traffic is just one approach to cost estimation. For big data mining, we also need to consider the investment in CPU, memory, network, and system isolation costs.

Different types of systems have different investment priorities. For example, services with more reads than writes should focus on more memory and network resources; strongly consistent services pay more attention to system isolation and partitioning; systems with more writes than reads emphasize storage performance optimization; systems with a high volume of both reads and writes focus on scheduling and changing system types.

Although there are many factors to consider in technical decision-making and our business and team situations may vary, I hope that through this lesson, you can grasp the thinking process of cost estimation and try to combine calculations to guide our decision-making. When you suggest building our own data center or choosing cloud services, having such a calculation as an aid will likely increase the probability of the proposal being approved.

Thought Questions #

  1. Is it more expensive to use cloud service providers for building a log center or to build it yourself?

  2. How is the cost of big data mining services calculated?

I look forward to interacting with you in the comments section and I also recommend that you share this lesson with more colleagues and friends. See you in the next lesson!