13 Storage Optimization Part2 How to Optimize Data Storage

13 Storage Optimization Part2 How to Optimize Data Storage #

“Transforming data of a specific structure into another format that can be recorded and restored” is my definition of storage in the previous issue.

Let’s review the six key elements of data storage: accuracy, time overhead, space overhead, security, development cost, and compatibility. It is impossible to achieve the best results in all elements. The so-called data storage optimization is to achieve the best results in one or several elements according to your own use cases.

Speaking more broadly, I believe that data storage is not necessarily about storing data in disks. For example, storing data in memory or transmitting it through a network can also be considered as a form of storage. We can also refer to this process as object or data serialization.

For most developers, we may not have the energy to “create” a data serialization format. So today, I will mainly talk about how to choose commonly used serialization methods in Android.

Object Serialization #

Objects in an application are stored in memory. If we want to store an object or transmit it over a network, we need to use object serialization and deserialization.

Object serialization converts all the information of an Object object into a byte sequence. This includes class information, inheritance relationship information, access permissions, variable types, and numeric information.

1. Serializable

Serializable is the native serialization mechanism in Java and is widely used in Android. We can use Serializable to persistently store objects or pass Serializable serialized data through a Bundle.

Principle of Serializable

Serializable is implemented through ObjectInputStream and ObjectOutputStream. Taking Android 6.0 source code as an example, you can see part of the implementation of ObjectOutputStream:

private void writeFieldValues(Object obj, ObjectStreamClass classDesc)  {
    for (ObjectStreamField fieldDesc : classDesc.fields()) {
        ...
        Field field = classDesc.checkAndGetReflectionField(fieldDesc);
        ...

The entire serialization process involves a lot of reflection and temporary variables. When serializing an object, not only the current object itself is serialized, but also other objects referenced by the object need to be serialized recursively.

The entire process is computationally complex, and the performance of serialization is poor due to the presence of a large amount of reflection and garbage collection. On the other hand, since the serialization file needs to contain a lot of information, its size is much larger than that of the Class file itself, which can lead to performance issues in I/O read and write operations.

Advancements in Serializable

Since Serializable performs poorly, what advantages does it have? Many students may not know that it has some advanced uses. You can refer to the article “5 Things You Didn’t Know About Java Object Serialization” for more information.

writeObject and readObject methods. Serializable serialization supports replacing the default flow. It first reflects and determines whether a serialization method writeObject or deserialization method readObject implemented by us exists. Through these two methods, we can make special modifications to certain fields and implement encryption for serialization.
writeReplace and readResolve methods. These two methods act as proxies for serialized objects and can customize the serialized instance returned. What are they used for? We can use them to achieve version compatibility in object serialization. For example, the readResolve method can convert serialized objects of older versions to objects of newer versions.

The calling process of serialization and deserialization with Serializable is as follows:

// Serialization
E/test:SerializableTestData writeReplace
E/test:SerializableTestData writeObject

// Deserialization
E/test:SerializableTestData readObject
E/test:SerializableTestData readResolve

Considerations for Serializable

Although Serializable is very simple to use, there are some considerations for fields.

Fields that are not serialized. By default, the default serialization mechanism ignores static variables and variables declared as transient, and does not serialize them for storage. Of course, we can also use advanced writeReplace and readResolve methods for custom serialization storage.
serialVersionUID. After a class implements the Serializable interface, we need to add a Serial Version ID, which acts as the version number of the class. This ID can be explicitly declared or calculated by the compiler. Usually, I recommend explicit declaration for better stability because implicit declaration can lead to InvalidClassException if there is even a slight change in the class.
Constructors. The default deserialization of Serializable does not execute the constructor; it creates objects based on the description information of the Object in the data stream. If some logic depends on the constructor, problems may occur. For example, if a static variable is assigned only in the constructor, we can also perform custom deserialization modifications through advanced methods. 2. Parcelable

Due to the low performance of Java’s Serializable, Android needs to design a lightweight and efficient object serialization and deserialization mechanism. Parcelable was born against this background, and its core purpose is to solve the performance problems of inter-process communication in Android.

Permanent storage of Parcelable

The principle of Parcelable is very simple, and its core implementation is in Parcel.cpp.

You can see that there is a big difference between Parcel serialization and Java’s Serializable serialization. Parcelable only performs serialization operations in memory and does not store data on disk.

Of course, we can also obtain byte arrays through the marshall interface of Parcel.java and store them in files to achieve permanent storage of Parcelable.

// Returns the raw bytes of the parcel.
public final byte[] marshall() {
    return nativeMarshall(mNativePtr);
}
// Set the bytes in data to be the raw bytes of this Parcel.
public final void unmarshall(byte[] data, int offset, int length) {
    nativeUnmarshall(mNativePtr, data, offset, length);
}

Points to note about Parcelable

In terms of time cost and usage cost, the Parcelable mechanism chooses performance first.

Therefore, it requires manual addition of custom code when writing and reading, making it more complex to use compared to Serializable. But because of this, Parcelable does not need to use reflection to implement serialization and deserialization.

Although Parcelable’s permanent storage can be achieved through clever methods, it also has two issues.

Compatibility of system versions. Since the original intention of Parcelable is to be used in memory, we cannot guarantee that the implementations of Parcel.cpp are completely consistent across all Android versions. If there are differences in the implementations of different system versions, or if vendors modify the implementations, problems may occur.
Compatibility of data before and after. Parcelable does not have a version management design. If the version of our class is upgraded, special attention needs to be paid to the compatibility of the write order and field types, which also brings a high maintenance cost.

Generally speaking, if persistent storage is required, we still have to choose the Serializable solution with lower performance.

3. Serial

As programmers, we will certainly pursue perfection. Is there a better performance solution that can solve these pain points?

In fact, almost every large company will have its own set of serialization solutions. In this column, I recommend the high-performance serialization solution Serial open-sourced by Twitter. But is it really high-performance? We can compare it with the previous two solutions.

From the data in the chart, Serial has great advantages in terms of the time and file size of serialization and deserialization.

From the implementation principle, Serial is like a solution that combines the advantages of Parcelable and Serializable.

Compared to traditional reflection-based serialization solutions, it is more efficient because it does not use reflection. You can refer to the test data above for details.
Developers have strong control over the serialization process and can define which Objects and Fields need to be serialized.
It has strong debugging capabilities and can debug the serialization process.
It has strong version management capabilities and can achieve compatibility through version numbers and OptionalFieldExceptions.

Serialization of Data #

Although Serial performs well, there is still a lot of information to record when serializing objects. When operations are frequent, it can have a significant impact on the application. In this case, we can choose to use data serialization.

1. JSON

JSON is a lightweight data interchange format widely used in network transmission. Many applications use JSON for communication with the server.

JSON has many unique advantages:

Compared to object serialization solutions, it is faster and smaller in size.
Compared to binary serialization solutions, the results are readable and easy to troubleshoot.
It is easy to use, supports cross-platform and cross-language usage, and supports nested references.

Since virtually every application uses JSON, every major company has its own JSON library. For example, Android comes with a JSON library, Google has Gson, Alibaba has Fastjson, and Meituan has MSON.

Each custom JSON solution mainly optimizes in the following two areas:

Convenience. For example, supporting JSON to JavaBean object conversion, supporting annotations, and supporting more data types.
Performance. Reducing reflection, reducing memory and CPU usage during serialization, especially when dealing with large amounts of data or deep nesting, the effect can be quite significant.

When dealing with smaller amounts of data, the built-in JSON library has some advantages. However, as the amount of data increases, the gap gradually widens. Overall, Gson has the best compatibility, and its performance is comparable to Fastjson in general. However, when dealing with extremely large amounts of data, Fastjson performs better.

2. Protocol Buffers

Compared to object serialization solutions, JSON does indeed offer faster speed and smaller size. However, in order to ensure that the intermediate results of JSON are readable, it does not compress its binary, which means JSON’s performance has not reached its peak.

If the application has a large amount of data or requires higher performance, Protocol Buffers is an excellent choice. It is a cross-language encoding protocol open-sourced by Google, and almost all of Google’s internal RPCs use this protocol.

Now let’s summarize the advantages and disadvantages of Protocol Buffers.

Performance: Protocol Buffers use binary encoding compression, resulting in smaller size and faster encoding and decoding speed compared to JSON. For those interested, you can refer to the Protocol Buffers encoding rules.
Compatibility: It has good cross-language and forward/backward compatibility, and also supports automatic conversion of basic types. However, it does not support inheritance and reference types.
Development Cost: Protocol Buffers have a high development cost, requiring the definition of .proto files and using tools to generate corresponding auxiliary classes. These auxiliary classes have some serialization methods. All objects that need to be serialized must be converted to objects of the auxiliary classes. This makes the serialization code tightly coupled with business code, which is a more intrusive approach.

For Android, using the official Protocol Buffers can lead to a large number of generated methods. We can modify its automatic code generation tool. For example, in WeChat, each generated class file from .proto will only contain one method, which is the op method.

public class TestProtocal extends  com.tencent.mm.protocal.protobuf {
    @Override
    protected final int op(int opCode, Object ...objs) throws IOException {
        if (opCode == OPCODE_WRITEFIELDS) {
           ... 
        } else if (opCode == OPCODE_COMPUTESIZE) {
           ...
        }
    }
}

Google later introduced FlatBuffers, which has a higher compression ratio. You can refer to Experience with FlatBuffers for its usage. Finally, let me compare Serialization, JSON, and Protocol Buffers, the three serialization solutions, based on the “six elements”.

Storage Monitoring #

Through local experiments, we can compare the performance of different file storage methods. However, the laboratory environment may not truly reflect the actual usage scenarios of users. Therefore, we also need to establish comprehensive monitoring for storage. What should be monitored?

1. Performance Monitoring

For the six key factors of correctness, time overhead, space overhead, security, development cost, and compatibility, I am more concerned about the following aspects in the production environment:

Correctness

As I mentioned in the 9th issue of this column, application programs, file systems, or disks can all cause file corruption.

In the production environment, I hope to monitor the failure rate of the storage module. In the previous issue, I also mentioned that the failure rate of SharedPreferences is about one in ten thousand, while the failure rate of our internally-developed SharedPreferences is about one in a hundred thousand. How do we define a file as corrupt? For the system’s SharedPreferences, we define corruption as the file size becoming zero. As for our internally-developed SharedPreferences, there will be dedicated validation fields in the file header, such as file length and CRC information at key positions, which can identify more file corruption scenarios. After identifying file corruption, we can further perform data repair and other operations.

Time Overhead

The time consumed by the storage module is also of great concern to me. In the production environment, time overhead monitoring can be divided into initialization time and read/write time. The emphasis may vary for each storage module. For example, for the storage module used during the startup process, we may want the initialization to be faster.

Taking the system’s SharedPreferences as an example again, during initialization, it needs to read and parse the entire file. If the content exceeds 1000 items, the initialization time may take 50 to 100 ms. Another internally-developed storage module we have, which supports random Read/Write, does not have its initialization time affected by the number of stored items. Even with tens of thousands of data, the initialization time is less than 1 ms.

Space Overhead

Space usage includes memory space and ROM space. Usually, in order to improve performance, we adopt the approach of exchanging space for time. For memory space, we need to consider garbage collection (GC) and peak memory usage, as well as the possibility of out-of-memory (OOM) situations when dealing with large amounts of data. For ROM space, we need to consider implementing cleaning logic, such as triggering automatic cleaning or data consolidation when the data exceeds 1000 items or 10MB.

2. ROM Monitoring

In addition to monitoring specific storage modules, we also need to monitor the ROM space of the entire application in detail. Why? This is because I found two problems that I frequently encounter.

We used to receive negative feedback from users frequently: why does WeChat occupy more than 2GB of ROM space? Is it because of a large database or some other reason? At that time, we were a bit unsure. Once, we found a bug in the production environment that caused a certain configuration to be downloaded repeatedly. As a result, the same content might have been downloaded by a user thousands of times.

download_1 download_2 download_3 ....

In the production environment, we sometimes find that there are freezes or ANRs when traversing a specific folder. As I mentioned in the 10th issue of this column, the time cost of file traversal is related to the number of files in the folder. We once encountered a bug that caused tens of thousands of files to exist in a specific folder. When traversing this folder, users’ phones would directly restart. It should be noted that starting from API level 26, it is recommended to use FileVisitor instead of ListFiles for file traversal, as the overall performance will be much better.

There are two core indicators for ROM monitoring: the total size of files and the total number of files. For example, we can define the proportion of users whose total file size exceeds 400MB as the space anomaly rate, and the proportion of users with more than 1000 files as the quantity anomaly rate. This way, we can continuously monitor the storage situation of the application in the production environment.

However, monitoring is only the first step. The key issue is how to quickly discover problems. Similar to the jank tree, we can also construct a storage tree for users and aggregate it in the background. However, the entire storage tree for a user can be very, very large, so we need to use some pruning algorithms here. For example, we can keep only the top 3 folders and 5 files in each folder. However, among these 5 files, we need to maintain a certain randomness so that not everyone will upload the same content.

While monitoring, we also need the ability to remotely control. When users complain, we can pull the complete storage tree of that user in real time. For storage problems discovered in the production environment, we can dynamically issue cleaning rules, such as automatically cleaning a certain cache folder when it exceeds 200MB, or deleting residual historical files.

Summary #

Different applications may have different priorities when it comes to optimizing storage. For small applications, development cost may be the most important factor, and we prioritize development efficiency. For mature applications, performance becomes more important. Therefore, when choosing a storage solution, it is necessary to analyze the specific problems based on the stage of the application and its usage scenario.

Whether it is optimizing the performance of a specific storage solution or the overall ROM storage of the application, we may pay less attention to storage monitoring. However, if there is a problem in this area, it can greatly affect the user experience. For example, we know that WeChat occupies a significant amount of ROM space. To address this issue, we have introduced a storage cleaning feature specifically for it. Furthermore, in scenarios where ROM space is insufficient, a prompt will appear to guide the user.

Homework #

Today’s homework is about choosing an object serialization and data serialization scheme for your application. What are your thoughts and experiences regarding data storage? Please share your scheme and ideas in the comment section and discuss them with other classmates.

Feel free to click “Please invite a friend to read” to share today’s content with your friends and invite them to learn together. Don’t forget to submit today’s homework in the comment section. I have prepared a generous “learning encouragement gift package” for students who complete the homework seriously. Looking forward to progressing together with you.