10 Case Analysis Objectives and Precautions for Reuse of Large Objects

10 Case Analysis- Objectives and Precautions for Reuse of Large Objects #

In this lesson, we will discuss the optimization of “large objects”. “Large objects” is a general concept that can exist in the JVM, be transferred over the network, or exist in databases. So why do large objects affect the performance of our applications?

First, large objects consume a lot of resources, and the garbage collector needs to spend some effort to reclaim them.
Second, exchanging large objects between different devices consumes network bandwidth and expensive I/O.
Third, parsing and processing large objects is time-consuming, and if the object’s responsibilities are not focused, it will incur additional performance overhead.

By combining caching, object pooling, and saving intermediate results, we can initially speed up large objects.

But this is not enough. We have only reduced the frequency of object creation without changing the fact that the objects are “large”. In this lesson, we will start with some knowledge points of the JDK and look at some frequently asked object reuse questions in interviews. Then, from the dimensions of structure and time, we will gradually explore strategies for reducing object size and focusing operations.

String’s substring method #

We all know that String is immutable in Java. If you modify its content, it will generate a new string.

If we want to use a part of the string, we can use the substring method.

Drawing 0.png

As shown in the above figure, when we need a substring, substring generates a new string, which is constructed using the Arrays.copyOfRange function in the constructor.

This function has no problem in JDK 7 and later versions, but it has a memory leak risk in JDK 6. We can learn from this case to see the problems that may arise from reusing large objects.

Drawing 1.png

The above figure is a screenshot from the official JDK. As you can see, when creating a substring, it does not just copy the required objects, but also references the entire value. If the original string is large, even if it is no longer used, the memory will not be released.

For example, the content of an article may be several megabytes, and we only need the summary information, but we still have to maintain the entire large object.

String content = dao.getArticle(id);
String summary = content.substring(0, 100);
articles.put(id, summary);

Some interviewers who have worked for a long time may still have the impression of substring in JDK 6. However, Java has already fixed this bug.

The significance for us is: if you create a large object and generate some other information based on this object, remember to remove the reference to this large object.

Collection’s Capacity Expansion for Large Objects #

Capacity expansion is a common phenomenon in Java for objects like StringBuilder, StringBuffer, HashMap, and ArrayList. In general, Java collections, including List, Set, Queue, and Map, have uncontrolled data. When the capacity is not enough, they will perform capacity expansion, which requires reorganizing the data. Therefore, they are not thread-safe.

Let’s first take a look at the expansion code for StringBuilder:

void expandCapacity(int minimumCapacity) { 
    int newCapacity = value.length * 2 + 2; 
    if (newCapacity - minimumCapacity < 0) 
        newCapacity = minimumCapacity; 
    if (newCapacity < 0) { 
        if (minimumCapacity < 0) // overflow 
            throw new OutOfMemoryError(); 
        newCapacity = Integer.MAX_VALUE; 
    } 
    value = Arrays.copyOf(value, newCapacity);
}

When the capacity is not enough, the memory will be doubled, and the source data will be copied using Arrays.copyOf.

Below is the resizing code for HashMap, where the size is also doubled after resizing. The resizing process is much more complex. In addition to the impact of the load factor, it also needs to rehash the original data. Since it cannot use the native Arrays.copy method, the speed will be slow.

void addEntry(int hash, K key, V value, int bucketIndex) { 
    if ((size >= threshold) && (null != table[bucketIndex])) { 
        resize(2 * table.length); 
        hash = (null != key) ? hash(key) : 0; 
        bucketIndex = indexFor(hash, table.length); 
    } 
    createEntry(hash, key, value, bucketIndex); 
}

void resize(int newCapacity) { 
    Entry[] oldTable = table; 
    int oldCapacity = oldTable.length; 
    if (oldCapacity == MAXIMUM_CAPACITY) { 
        threshold = Integer.MAX_VALUE; 
        return; 
    } 
    Entry[] newTable = new Entry[newCapacity]; 
    transfer(newTable, initHashSeedAsNeeded(newCapacity)); 
    table = newTable; 
    threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1); 
}

You can look up the code for List on your own. The resizing strategy is to make the new length 1.5 times the original length.

Since collections are used very frequently in code, if you know the specific upper limit of the data items, it is advisable to set a reasonable initial size. For example, if HashMap needs 1024 elements, it will require 7 resizings, which will affect the performance of the application. This question frequently appears in interviews, and you need to understand the impact of these resizing operations on performance.

However, it is important to note that for collections with a load factor (0.75) like HashMap, the initial size should be set as number of items required / load factor + 1. If you are not very familiar with the underlying data structure, it is better to keep the default size.

Next, I will explain the optimization at the application level from the perspectives of data structure and time dimension.

Maintain the appropriate object granularity #

Let me share a practical case with you: we have a high-concurrency business system that frequently needs to use user’s basic data.

As shown in the figure below, since the user’s basic information is stored in another service, each time we need the user’s basic information, we need a network interaction. It is even more unacceptable that even if we only need the user’s gender attribute, we still need to query and fetch all user’s information.

Drawing 2.png

In order to speed up the data query, as described in our previous [《08 | Case Study: How Does Redis Help with Seckill Business》], we have cached the data in Redis, which greatly improves the query performance, but still needs to query a lot of redundant data each time.

The original Redis key is designed as follows:

type: string 
key: user_${userid} 
value: json

This design has two problems:

To query the value of a specific field, all json data needs to be queried and parsed.
To update the value of a specific field, the entire json string needs to be updated, which is costly.

For this kind of large-granularity json information, we can optimize it by sharding it, so that each update and query has a focused target. Next, the data in Redis is designed as follows, using hash structures instead of JSON structures:

type: hash 
key: user_${userid} 
value: {sex:f, id:1223, age:23}

In this way, we can use the hget command or the hmget command to retrieve the desired data, speeding up the flow of information.

Bitmap Shrinks Objects #

In addition to the above operations, can we further optimize? For example, in our system, we frequently use gender data for users, to distribute gifts, recommend opposite-sex friends, periodically clean up user actions, etc. Or, store some user status information, such as whether they are online, whether they have signed in, whether they have recently sent messages, in order to calculate the number of active users, etc. For operations on yes or no values, we can use the Bitmap structure to compress them.

Here’s another frequently asked interview question: how many bits does a Java Boolean occupy?

In the Java Virtual Machine Specification, it is described that Boolean type is mapped to 1 and 0, occupying the same 32 bits as an int. Even if some virtual machine implementations map Boolean to the byte type, the space it occupies is still too large for a large number of regular Boolean values.

As shown in the code, by checking each bit in an int, it can store 32 Boolean values!

int a= 0b0001_0001_1111_1101_1001_0001_1111_1101;

Bitmap is a data structure that uses bits to store data, with each value being either 0 or 1. Remember when we discussed cache penetration in the previous [Case Study: How Redis Helps Seckill Business]? Bitmap can be used to avoid it. The relevant data structure class in Java is java.util.BitSet, which is implemented using a long array as its underlying data structure, so its minimum capacity is 64.

For 10 billion Boolean values, it only requires 128MB of memory. Below is an example of gender determination that takes up 256MB of memory and can cover ids with a length of 10 billion.

static BitSet missSet = new BitSet(010_000_000_000); 
static BitSet sexSet = new BitSet(010_000_000_000); 
String getSex(int userId) { 
    boolean notMiss = missSet.get(userId); 
    if (!notMiss) { 
        //lazy fetch 
        String lazySex = dao.getSex(userId); 
        missSet.set(userId, true); 
        sexSet.set(userId, "female".equals(lazySex)); 
    } 
    return sexSet.get(userId) ? "female" : "male"; 
}

Storing this data in heap memory is still too large. Fortunately, Redis also supports the Bitmap data structure. If there is memory pressure, we can put this structure into Redis and the logic for determining gender would be similar.

Here’s another interview algorithm question: Given a machine with 1GB memory, providing 6 billion int data, how can we quickly determine which data is duplicated?

Take a moment to think about it. Bitmap is a relatively low-level data structure. On top of it, there is another structure called Bloom Filter. Bloom Filter can indicate that a value does not exist or may exist.

Drawing 3.png

As shown, unlike Bitmap, Bloom Filter has an additional layer of hash algorithm. Since it is a hash algorithm, there may be collisions, so multiple values may fall onto the same bit. Unlike HashMap, which uses linked lists or red-black trees to handle collisions, Bloom Filter directly reuses this hash slot. From this feature, we can see that Bloom Filter can clearly indicate that a value is not in the set, but cannot determine whether a value definitely exists in the set.

Guava has a class called BloomFilter that can easily implement this functionality.

The above optimization technique essentially involves shrinking large objects into smaller ones, which is a common approach in software design. For example, for a newly published article, the frequently used data is the abstract, so there is no need to retrieve the entire content. For a user’s feed information, only the visible information needs to be guaranteed to be fast, while the complete information is stored in a slower and larger storage.

Separation of Hot and Cold Data #

In addition to the horizontal structural dimension of data, there is also a vertical dimension of time. The most effective way to optimize for the time dimension is through separation of hot and cold data. The so-called hot data refers to data that is close to the user and frequently used, while cold data refers to data with very low access frequency and a long history.

Running the same complex SQL on tens of millions of data tables will have much worse performance compared to running it on millions of data tables. Therefore, even though your system may have been fast when it was first launched, it will gradually become slower as the amount of data increases over time.

Cold-hot separation involves dividing the data into two parts, as shown in the diagram below. Typically, one complete set of data is kept for performing time-consuming statistical operations.

Drawing 4.png

Cold-hot separation is a topic that is often encountered in work, so interviewers frequently ask about solutions for separating hot and cold data. Here, I will briefly introduce three methods:

1. Data Dual Writing #

All insert, update, and delete operations on the cold and hot databases are performed in a unified transaction. Since the hot database (e.g. MySQL) and the cold database (e.g. Hbase) are different types, this transaction is most likely a distributed transaction. In the early stages of a project, this approach is feasible, but if it involves modifying some legacy systems, distributed transactions are difficult to modify, so I usually abandon this solution.

2. Writing to MQ for Distribution #

When performing data operations, instead of writing directly to the database, the data is first sent to a message queue (MQ) using the publish-subscribe mechanism. A separate consumer process is then started to write the data from the MQ to the cold and hot databases respectively. This approach results in clear logic and a more elegant structure. For systems with clear structures like orders and low requirements for sequencing, the MQ distribution approach can be used. However, if the size of your database entity is very large, you need to consider the complexity of the program when using this approach.

3. Using Binlog Synchronization #

For MySQL, Binlog can be used for synchronization. By using the Canal component, the latest Binlog data can be continuously obtained, and combined with the MQ, the data can be synchronized to other data sources.

Divergent Thinking #

For result set operations, we can further diverge our thinking. We can transform a simple and redundant result set into a complex and efficient data structure. This complex data structure can act as a proxy for our requests, effectively transferring time-consuming operations.

For example, the commonly used database indexing is a way to reorganize and accelerate data.

B+ trees can effectively reduce the number of interactions between the database and the disk. By using a B+ tree-like data structure, the most frequently used data is indexed and stored in a limited storage space.

Another example is serialization, which is commonly used in RPC.

Some services use the SOAP protocol for WebService, which is an XML-based protocol with slow transmission speed and low efficiency. Nowadays, most Web services use json data for interaction, which is more efficient compared to SOAP.

In addition, many people have heard of Google’s protobuf. Due to its binary protocol and data compression, it has excellent performance. After compression, the size of protobuf is only 1/10 of json and 1/20 of XML, but its performance is improved by 5-100 times.

Protobuf’s design is worth learning from. It efficiently processes data using the tag|length|value structure, resulting in fast parsing and transmission speeds.

Summary #

To summarize the key points of this lesson:

First, we looked at the issue of content leaks caused by the reuse of String objects in older versions of JDK. Therefore, in our usual coding, we must pay attention to the recycling of large objects and cut off their connections in a timely manner.

Next, we looked at some resizing operations on Java collections. If you know the exact size of the collection, you can specify an initial value to avoid time-consuming resizing operations.

For large objects, we have two methods of optimizing from a structural dimension and a time dimension:

From a structural perspective, we can concentrate operations on small data structures by splitting objects into appropriate granularities and avoid the storage and transmission costs of large objects by compressing, transforming, or extracting hot data.

From a time perspective, we can use cold-hot separation to store frequently used data in high-speed devices, reducing the collection of data processing and increasing processing speed.

So far, we have covered buffering, caching, object pooling, result caching, and large object processing as performance optimization methods. However, since they all involve additional intermediate layers, the programming model becomes more complex.

Next, in the next lesson “Lesson 11 | Case Study: How to Optimize Performance Using Design Patterns”, I will introduce several commonly used design patterns to see how design patterns can help us optimize performance and what things we need to pay attention to.