05 What's the Difference Between String, String Buffer and String Builder

05 What’s the difference between String, StringBuffer and StringBuilder #

Today, I’m going to talk about strings, which are commonly used in daily programming. Despite seeming simple, strings are actually a special entity in almost every programming language. This is because strings are an important component in most applications, both in terms of quantity and size.

Today, I want to ask you the question: What’s the difference between String, StringBuffer, and StringBuilder in Java?

Typical Answer #

String is a very fundamental and important class in Java language, which provides various basic logic for constructing and managing strings. It is a typical Immutable class, declared as a final class, and all its properties are also final. Because of its immutability, actions such as concatenating and truncating strings will generate new String objects. Due to the ubiquity of string operations, the efficiency of related operations often has a significant impact on application performance.

StringBuffer is a class provided to solve the problem mentioned above, which is the excessive creation of intermediate objects during concatenation. We can use the append or insert methods to add strings to an existing sequence at the end or at a specified position. StringBuffer is essentially a thread-safe, mutable character sequence. It guarantees thread safety but comes with additional performance overhead. Therefore, unless there is a need for thread safety, it is still recommended to use its successor, StringBuilder.

StringBuilder is a new addition in Java 1.5. It has essentially the same capabilities as StringBuffer but has removed the thread-safe part, resulting in reduced overhead. It is the preferred choice for string concatenation in the vast majority of cases.

Analysis of the Key Points #

Almost all application development relies on string manipulation. Understanding the design and implementation of strings, as well as related tools such as concatenation, is very helpful in writing high-quality code. Regarding this question, my previous answer was a general overview. At the very least, you need to know that String is immutable, improper string manipulation may lead to the creation of a large number of temporary strings, and there are differences in thread safety.

If we go deeper, the interviewer can examine various aspects, such as:

Examining basic thread safety design and implementation, as well as various basic programming practices through String and related classes.
Examining the understanding of the JVM object cache mechanism and how to use it effectively.
Examining some tricks for optimizing Java code by the JVM.
Examining the evolution of String and related classes, such as the significant changes implemented in Java 9.
…

For each of the above aspects, I will discuss them in detail in the “Knowledge Expansion” section.

Knowledge Expansion #

Considerations for String Design and Implementation

I mentioned earlier that String is a typical implementation of an immutable class, and its native design guarantees basic thread safety because you cannot modify its internal data. This convenience is even reflected in the copy constructor, as Immutable objects do not need to copy data when being copied.

Let’s take a look at some details of the StringBuffer implementation. Its thread safety is achieved by adding the synchronized keyword to methods that modify data, which is very straightforward. In fact, this simple and direct implementation approach is very suitable for common thread-safe class implementations. There is no need to worry about performance-related issues like synchronized. As some people say, “Premature optimization is the root of all evil.” Considerations for reliability, correctness, and code readability are the most important factors in most application development.

In order to achieve the purpose of modifying character sequences, both StringBuffer and StringBuilder use modifiable arrays (char, and from JDK 9 onwards, byte) as their underlying data structure. Both of them inherit from the AbstractStringBuilder class, which contains basic operations, with the difference being whether the final methods are synchronized.

Furthermore, how large should this internal array be? If it is too small, a sufficiently large array may need to be created when concatenating; if it is too large, it will waste space. The current implementation initializes the initial size of the array to the length of the constructed string plus 16 (which means that if the initial string is not provided when constructing the object, the initial value is 16). If we are sure that concatenation will occur many times and it is roughly predictable, we can specify an appropriate size to avoid the overhead of resizing the array many times. Resizing incurs multiple overheads because the original array needs to be discarded, a new array needs to be created (which can be simplified as a multiple of the original), and an arraycopy operation needs to be performed.

Based on the above content, how should we choose in the specific code implementation?

If there is no thread safety issue, should we use StringBuilder for all concatenation operations? After all, the code written in this way requires you to type more characters, and the readability is not ideal. The following comparison is quite obvious:

String strByBuilder  = new
StringBuilder().append("aa").append("bb").append("cc").append
            ("dd").toString();
         
String strByConcat = "aa" + "bb" + "cc" + "dd";

In fact, in most cases, there is no need to worry too much. We should trust that Java is quite intelligent.

Let’s conduct an experiment. Compile the following piece of code using different versions of JDK, and then decompile it, for example:

public class StringConcat {
     public static String concat(String str) {
       return str + "aa" + "bb";
     }
}

First, compile and then decompile using different versions of JDK:

${JAVA_HOME}/bin/javac StringConcat.java
${JAVA_HOME}/bin/javap -v StringConcat.class

The output fragment for JDK 8 is:

         0: new           #2                  // class java/lang/StringBuilder
         3: dup
         4: invokespecial #3                  // Method java/lang/StringBuilder."<init>":()V
         7: aload_0
         8: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        11: ldc           #5                  // String aa
        13: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        16: ldc           #6                  // String bb
        18: invokevirtual #4                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        21: invokevirtual #7                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;

However, in JDK 9, the decompiled result becomes a bit special, with the fragment being:

         // concat method
         1: invokedynamic #2,  0              // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;)Ljava/lang/String;
     
         // ...
         // Actually, it utilizes MethodHandle to unify the entry point
         0: #15 REF_invokeStatic java/lang/invoke/StringConcatFactory.makeConcatWithConstants:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/invoke/CallSite;

You can see that in JDK 8, non-static string concatenation logic is automatically converted to StringBuilder operations by javac; while in JDK 9, the change in thinking is reflected. Java 9 uses InvokeDynamic to decouple string concatenation optimization from the bytecode generated by javac. Assuming future JVM enhancements to related runtime implementations, no modifications dependent on javac will be required.

In everyday programming, ensuring program readability and maintainability is often more important than so-called optimal performance. You can choose the specific coding method based on actual needs.

String Caching

We have roughly calculated that, when dumping the heap of common applications and analyzing the composition of objects, on average 25% of the objects are strings, and about half of them are duplicates. Avoiding the creation of duplicate strings can effectively reduce memory consumption and object creation overhead.

Since Java 6, String has provided the intern() method, which is used to instruct the JVM to cache the corresponding string for reuse. When we create a string object and call the intern() method, if there is already a cached string, the instance in the cache will be returned; otherwise, it will be cached. Generally speaking, the JVM will cache all text strings like “abc” or string constants.

It looks pretty good, right? But the actual situation may surprise you. It is generally not recommended to use intern extensively in older versions of Java like Java 6. Why? The devil lies in the details. The cached strings are stored in the so-called PermGen, the notorious “permanent generation”. This space is very limited and is basically not taken care of by garbage collection other than FullGC. Therefore, if used improperly, OOM may occur.

In subsequent versions, this cache was placed in the heap, which greatly avoided the problem of PermGen exhaustion. In JDK 8, the PermGen was replaced by MetaSpace. Moreover, the default cache size is constantly expanding. It was initially 1009 and was changed to 60013 after 7u40. You can use the following parameter to directly print the specific number. You can try it immediately with your own JDK.

-XX:+PrintStringTableStatistics

You can also use the following JVM parameter to adjust the size manually, but in most cases, it is not necessary to adjust unless you are sure that its size has affected the operational efficiency.

-XX:StringTableSize=N

Intern is an explicit deduplication mechanism, but it also has certain side effects because it requires developers to explicitly call it in the code, which is inconvenient. It is very cumbersome to call it explicitly for every instance. Another issue is that it is difficult for us to guarantee efficiency, as it is difficult to predict the duplication situation of strings during application development. Some people believe that this is a practice that pollutes the code.

Fortunately, starting from Oracle JDK 8u20, a new feature called string deduplication under G1 GC was introduced. It achieves deduplication by pointing strings with the same data to the same data, and is a low-level change in the JVM, requiring no modifications to the Java class library.

Note that this feature is currently disabled by default. You need to enable it using the following parameter, and remember to specify using G1 GC:

-XX:+UseStringDeduplication

The aspects mentioned earlier are just a few optimizations that Java makes to strings at the underlying level. During runtime, some basic operations on strings directly utilize the Intrinsic mechanism within the JVM. These operations often run special optimized native code instead of Java bytecode generated by the JVM. Intrinsic can be understood as a type of hard-coded logic that uses native methods. Many optimizations still require the direct use of specific CPU instructions. You can find the relevant Intrinsic definitions by searching for “string” in the related source code. Of course, you can also use the following parameters when starting your application to understand the state of Intrinsic:

-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
    // Sample output fragment  
        180    3       3       java.lang.String::charAt (25 bytes)  
                                  @ 1   java.lang.String::isLatin1 (19 bytes)   
                                  ...  
                                  @ 7 java.lang.StringUTF16::getChar (60 bytes) intrinsic

As you can see, even just the implementation of strings requires a great deal of effort and dedication from Java platform engineers and scientists. Many of the conveniences we enjoy are derived from these optimizations.

In the later topics on JVM and performance, I will provide a detailed introduction to some of the methods for internal optimization in the JVM. If you are interested, you can further delve into these topics. Even if you are not involved in JVM development or have not yet needed to perform specific performance optimizations, this knowledge can help you deepen your technical understanding.

Evolution of String

If you have carefully observed Java strings in different versions, you would have noticed that in earlier versions, they were stored using char arrays, which is very straightforward. However, in Java, char arrays have a size of two bytes, which is unnecessary for languages that primarily use Latin characters. This uniform implementation led to some wastage. Density is an eternal topic in programming language platforms because ultimately, most tasks involve working with data.

In fact, as early as Java 6, Oracle JDK introduced the feature of compressing strings. However, this feature was not open source, and in practice, it exposed some problems. As a result, it has been removed in the latest JDK versions.

In Java 9, we introduced the design of Compact Strings, which is a major improvement to strings. The data storage format was changed from a char array to a byte array along with a coder to signify the encoding. Additionally, all related string manipulation classes were modified. Furthermore, all related Intrinsics were rewritten to ensure that there is no performance loss.

Although there has been a significant change in the underlying implementation, the behavior of Java strings has not changed significantly. Therefore, this feature is transparent to most applications, and there is usually no need to modify existing code.

Of course, in extreme cases, there have been some limitations to the capabilities of strings, such as the maximum size of a string. You can think about how the original implementation with char arrays limited the maximum length of strings to the length of the array itself. However, with the replacement of char arrays with byte arrays, the storage capacity has halved for the same array length! Fortunately, this is a theoretical limit that has not been observed to affect real-world applications.

In general performance testing and product experiments, we can clearly see the advantages of compact strings, such as smaller memory usage and faster operations.

Today, I started analyzing the main design and implementation characteristics of String, StringBuffer, and StringBuilder. I discussed the string cache mechanism known as intern, the non-intrusive duplicate elimination at the virtual machine level, the improvements in compact characters in Java 9, and briefly touched on the intrinsic optimization mechanism at the JVM’s low level. From a practical perspective, whether it is Compact Strings or low-level intrinsic optimization, they demonstrate the advantages of using Java’s basic class libraries. These libraries often receive maximum and high-quality optimizations, and by upgrading the JDK version, you can enjoy these benefits at zero cost.

Practice Exercise #

Have you grasped the topic we discussed today? Due to limited space, we didn’t have time to discuss many character-related issues, such as encoding. You can consider this: many string operations, such as getBytes() or [String](https://docs.oracle.com/javase/9/docs/api/java/lang/String.html#String-byte:A-)(byte[] bytes), implicitly use the platform’s default encoding. Is this a good practice? Does it help avoid garbled characters?

Please write down your thoughts on this issue in the comments section, or share any pitfalls you have encountered while manipulating strings. I will select the thoughtful comments and award you a study encouragement reward. You are welcome to discuss with me.

Are your friends also preparing for interviews? You can “invite friends to read” and share today’s topic with them. Perhaps you can help them.