03 String Performance Optimization It's Easy to Store Tens of Gb Data in Memory With Only a Hundred Mb Capacity

03 String performance optimization It’s easy to store tens of GB data in memory with only a hundred MB capacity #

Hello, I am Liu Chao.

Starting from the second module, I will guide you through the performance optimization of Java programming. Today, let’s start with the basics of optimizing String.

String objects are one of the most frequently used object types, but their performance issues are often overlooked. String objects, as an important data type in the Java language, occupy the largest space in memory. Efficient utilization of strings can improve overall system performance.

Next, we will delve into three aspects: the implementation and characteristics of the String object, as well as optimization in practical use.

Before we start, I’d like to ask you a question, which is often asked during job interviews. Although it’s a common question, the error rate is still high. Of course, there are some interviewees who answer correctly, but very few can explain the principles behind the answer. The question is as follows:

Three objects are created through three different methods, and then they are pairwise compared. Are the two matched objects in each pair equal? The code is as follows:

    String str1 = "abc";
    String str2 = new String("abc");
    String str3 = str2.intern();
    assertSame(str1 == str2);
    assertSame(str2 == str3);
    assertSame(str1 == str3);

You can take a moment to consider the answer and the reasons for your answer. I hope that through today’s learning, you will be able to get a perfect score.

How is the String object implemented? #

In the Java language, engineers at Sun Microsystems have optimized the String object to save memory space and improve the performance of the String object in the system. Let’s take a look at the optimization process:

1. In Java 6 and earlier versions, the String object is implemented by encapsulating a char array. It has four instance variables: char array, offset, count, and hash.

The String object uses the offset and count properties to locate the char[] array and retrieve the string. This approach efficiently and quickly shares the array object, while saving memory space, but it may also lead to memory leaks.

2. From Java 7 to Java 8, Java made some changes to the String class. The String class no longer has the offset and count variables. The benefit of this change is that the String object occupies slightly less memory, and the String.substring method no longer shares the char[], thus solving the potential memory leak issue when using this method.

3. Starting from Java 9, engineers changed the char[] field to a byte[] field and added a new property called coder, which is a flag for the encoding format.

Why did engineers make this modification?

We know that a char character occupies 16 bits, or 2 bytes. In this case, storing characters in a single-byte encoding would be very wasteful. In JDK 9, the String class uses a byte array that occupies 8 bits, or 1 byte, to store strings in order to save memory space.

The purpose of the new property, coder, is to determine how to calculate the length of the string or use the indexOf() function based on its value. The coder property has two default values: 0 represents Latin-1 (single-byte encoding) and 1 represents UTF-16. If String determines that the string only contains Latin-1 characters, the value of the coder property is 0; otherwise, it is 1.

The Immutability of String Objects #

After understanding the implementation of String objects, have you noticed that the String class is marked with the final keyword, and the char array variable is also marked with the final keyword.

We know that when a class is marked with the final keyword, it means that the class cannot be inherited, and when a char[] is marked with the final+private keywords, it means that the String object cannot be modified. This feature implemented by Java is called the immutability of String objects, which means that once a String object is created, it cannot be changed.

What are the benefits of this approach in Java?

First, it ensures the security of String objects. If String objects were mutable, they could potentially be maliciously modified.

Second, it ensures that the hash value does not change frequently, ensuring uniqueness and enabling features such as key-value caching in containers like HashMap.

Third, it allows for a string constant pool. In Java, there are usually two ways to create String objects. One is through the use of string constants, like String str = “abc”; and the other is by using the new keyword, like String str = new String(“abc”).

When the first method is used to create a String object in the code, the JVM first checks if the object is in the string constant pool. If it is, the reference to that object is returned; otherwise, a new string will be created in the constant pool. This method reduces the creation of duplicate String objects with the same value, conserving memory.

In the case of String str = new String(“abc”), during the compilation of the class file, the “abc” constant string will be placed in the constant structure, and during class loading, “abc” will be created in the constant pool. Then, when calling new, the JVM command will invoke the constructor of String and reference the “abc” string in the constant pool to create a String object in the heap memory. Finally, str will reference the String object.

Here is a classic counterexample that you might think of.

When programming, it is common to assign the value “hello” to a String object str, and then assign the value “world” to str, at which point the value of str becomes “world”. So, the value of str has indeed changed, why do I still say that String objects are immutable?

First, let me explain what an object and an object reference are. There tends to be some misconception about this among Java beginners, especially those who transition from PHP to Java. In Java, to compare two objects for equality, we often use ==, but to compare the values of two objects, we need to use the equals method to check.

This is because str is merely a reference to the String object, not the object itself. An object is a block of memory in which str is a reference pointing to that memory location. Therefore, in the example we just mentioned, during the first assignment, a “hello” object is created and str references the “hello” address; during the second assignment, a new object “world” is created, and str references “world”, but the “hello” object still exists in memory.

In other words, str is not the object itself, but only an object reference. The actual object still exists in memory and has not been changed.

Optimization of String Objects #

After understanding the implementation principles and characteristics of String objects, let’s combine them with practical scenarios to see how to optimize the use of String objects and what to pay attention to during the optimization process.

1. How to build a large string? #

String concatenation is common in programming. As I mentioned earlier, String objects are immutable. If we use the + operator to concatenate String objects, will it create multiple objects? For example, consider the following code:

String str = "ab" + "cd" + "ef";

By analyzing the code, we can see that it first generates the object ab, then generates the object abcd, and finally generates the object abcdef. Theoretically, this code is inefficient.

However, in practice, we find that only one object is generated. Why is that? Did our theoretical judgment go wrong? Let’s take a look at the compiled code. You will find that the compiler automatically optimizes this line of code as follows:

String str = "abcdef";

I have just introduced the concatenation of string constants. Now let’s take a look at the concatenation of string variables. Consider the following code:

String str = "abcdef";

for(int i = 0; i < 1000; i++) {
      str = str + i;
}

After compiling the above code, you can see that the compiler also optimizes this code. It is not difficult to find out that Java tends to use StringBuilder when concatenating strings, which can improve the efficiency of the program.

String str = "abcdef";

for(int i = 0; i < 1000; i++) {
      str = (new StringBuilder(String.valueOf(str))).append(i).toString();
}

In conclusion, even if the + operator is used for string concatenation, it can be optimized by the compiler into the StringBuilder approach. However, if you look more closely, you will find that in the optimized code generated by the compiler, a new StringBuilder instance is created for each iteration, which also reduces the performance of the system.

Therefore, when concatenating strings, I recommend that you explicitly use StringBuilder to improve system performance.

If string concatenation involving String objects is related to thread safety in multi-threaded programming, you can use StringBuffer. However, please note that because StringBuffer is thread-safe and involves lock contention, its performance is slightly worse than StringBuilder.

2. How to use String.intern to save memory? #

After discussing string concatenation, let’s talk about the storage of String objects. Let’s take a look at a case.

When Twitter publishes a message status, it generates an address information. According to the scale of Twitter users at that time, the server needs 32GB of memory to store the address information.

public class Location {
    private String city;
    private String region;
    private String countryCode;
    private double longitude;
    private double latitude;
}

Considering that many users have overlapping information in their address, such as country, province, city, etc., this part of the information can be separated into a separate class to reduce duplication. Here is the code:

public class SharedLocation {

    private String city;
    private String region;
    private String countryCode;
}

public class Location {

    private SharedLocation sharedLocation;
    double longitude;
    double latitude;
}

By optimizing, the data storage size is reduced to around 20GB. But for memory storage, it is still large. What can we do?

This case comes from a presentation by a Twitter engineer at the QCon global software development conference. The solution they came up with is to use String.intern to save memory space and optimize the storage of String objects.

The specific approach is to use the intern method of String when assigning values. If there is a string with the same value in the constant pool, the same object will be reused, and the object reference will be returned, so that the initial object can be garbage collected. This method can reduce the storage size of address information with high redundancy from 20GB to a few hundred megabytes.

SharedLocation sharedLocation = new SharedLocation();

sharedLocation.setCity(messageInfo.getCity().intern());
sharedLocation.setCountryCode(messageInfo.getRegion().intern());
sharedLocation.setRegion(messageInfo.getCountryCode().intern());

Location location = new Location();
location.set(sharedLocation);
location.set(messageInfo.getLongitude());
location.set(messageInfo.getLatitude());

To better understand, let’s review the principle with a simple example:

String a = new String("abc").intern();
String b = new String("abc").intern();

if(a == b) {
    System.out.print("a == b");
}

Output:

a == b

In string constants, objects are automatically put into the constant pool. In string variables, objects are created in the heap memory and a string object is also created in the constant pool, which is copied to the heap memory object and the heap memory object reference is returned.

When the intern method is called, it checks whether there is a string in the constant pool that is equal to the object. If there isn’t, the object is added to the constant pool and its reference is returned. If there is, the reference to the string in the constant pool is returned. The original object in the heap memory, because it is not referenced, will be garbage collected.

Now let’s look at the example we mentioned earlier.

When the variable a is created at the beginning, an object is created in the heap memory, and at the same time, when the class is loaded, a string object is created in the constant pool. After calling the intern method, it checks whether there is a string object in the constant pool that is equal to the string. If it doesn’t find one, it adds the string object to the constant pool and returns its reference.

When creating the string variable b, an object is also created in the heap memory. At this time, the constant pool already has the string object, so it is not created again. When calling the intern method, it checks whether there is a string object in the constant pool that is equal to the string. It finds that there is already a string object equal to “abc”, so it directly returns its reference. The object in the heap memory, because it is not referenced, will be garbage collected. Therefore, a and b reference the same object.

Now let’s summarize the situation of string creation and memory allocation with a diagram:

Please note that when using the intern method, you must consider the actual scenario. Because the implementation of the constant pool is similar to that of a HashTable, the larger the data stored, the longer the traversal time complexity will be. If the data is too large, it will increase the burden on the entire string constant pool.

3. How to use string splitting methods? #

Finally, I want to talk to you about string splitting, another commonly used method in coding. The split() method uses regular expressions to implement its powerful splitting function, and the performance of regular expressions is very unstable. Improper use may cause backtracking issues and high CPU usage.

Therefore, we should use the split() method with caution. We can use the indexOf() method of String to replace the split() method to achieve string splitting. If it is really necessary, when using the split() method, pay attention to the backtracking problem.

Summary #

In this lecture, we have realized that by optimizing the performance of String strings, we can improve the overall performance of the system. Based on this theory, the Java version optimizes the String objects by continuously changing the member variables to save memory space.

We also specifically mentioned the immutability of String objects, which allows the implementation of a string constant pool. By reducing the duplicate creation of string objects with the same value, further memory savings are achieved.

However, it is because of this feature that we need to use StringBuilder explicitly when doing long string concatenation to improve performance. Finally, in terms of optimization, we can also use the intern method to reuse variable string objects that have the same value in the constant pool, thus saving memory.

Finally, I would like to share a personal observation. A dike can collapse from an ant hole. In daily programming, we often may not have a deep understanding of a small string and may not use it appropriately, which can lead to incidents in production.

For example, in my previous work experience, I once caused a concurrency bottleneck by using regular expressions to match strings, which can also be categorized as a performance issue with strings. In the upcoming lecture 04, I will provide a detailed analysis on this specific case.

Reflection questions #

After today’s lesson, do you know the answer to the interview question at the beginning of the article? What is the underlying principle behind it?

Interactive Moment #

Today, besides the thought-provoking questions, I also want to have a brief communication with you.

In the previous two lessons, I received a lot of comments, and I am very grateful for your support. Since the first two lessons were an overview, mainly to help you establish a general understanding of performance optimization, they were relatively theoretical and foundational. However, I found that many students have this urgent desire to quickly learn how to use troubleshooting tools, monitor and analyze performance, and solve current problems.

Here, I would like to share something special. Performance optimization is not only about learning to use troubleshooting and monitoring tools, but also about understanding the principles behind optimization. This way, you will not only be able to independently solve performance issues of the same kind, but also be able to write high-performance code. Therefore, I hope to provide you with the following learning path: Solidify the foundation - Combine theory with practice - Achieve advancement.