03 Serialization How Do Objects Get Transmitted Over the Network

03 Serialization - How do objects get transmitted over the network #

Hello, I am He Xiaofeng. In the previous lecture, I explained how to design a scalable and backward compatible protocol in an RPC framework. The key point is to utilize the extension fields in the header and payload to achieve backward compatibility.

Following up on the previous lecture, today I will explain serialization in the RPC framework. It is extremely important to choose the right serialization method in different scenarios to improve the overall stability and performance of the RPC framework.

Why do we need serialization? #

First, we need to understand what serialization and deserialization are.

Let’s start by reviewing the content of the RPC principles discussed in [Lesson 01]. When describing the RPC communication process, I mentioned:

Data transmitted over the network must be in binary format, but the input and output parameters of the caller are objects. Objects cannot be directly transmitted over the network, so we need to convert them into binary format in advance. We also require the conversion algorithm to be reversible, and this process is generally called “serialization”. At this point, the service provider can correctly separate different requests from the binary data, and based on the request type and serialization type, it can reverse the binary message body into request objects. This process is called “deserialization”.

These two processes are illustrated in the following diagram:

In summary, serialization is the process of converting an object into binary data, while deserialization is the reverse process of converting binary data back into an object.

So why does an RPC framework need serialization? Let’s recall the RPC communication process:

To help you understand, let’s use an analogy of sending a package. When we need to send an item that needs to be assembled by the recipient, the sender will unpack the item before sending it, which is similar to serialization. When the courier arrives, the package cannot be damaged, so it needs to be packed, which is similar to encoding the serialized data and packaging it into a fixed format protocol. Two days later, when the recipient receives the package, they will unpack it and assemble the item, which is similar to protocol decoding and deserialization.

So now it’s clear, right? Since data transmitted over the network must be in binary format, in an RPC call, serialization and deserialization of the input parameter objects and return value objects are necessary processes.

What are the commonly used serialization methods? #

From this perspective, do you think this process is simple? It is actually very complex. Let’s take a look at the commonly used serialization methods, and I will briefly introduce several of them.

JDK Native Serialization #

If you are familiar with Java development, you must know about JDK’s native serialization. Here is an example of JDK serialization:

import java.io.*;

public class Student implements Serializable {
    private int no;
    private String name;

    public int getNo() {
        return no;
    }

    public void setNo(int no) {
        this.no = no;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public String toString() {
        return "Student{" +
                "no=" + no +
                ", name='" + name + '\'' +
                '}';
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException {
        String home = System.getProperty("user.home");
        String basePath = home + "/Desktop";
        FileOutputStream fos = new FileOutputStream(basePath + "student.dat");
        Student student = new Student();
        student.setNo(100);
        student.setName("TEST_STUDENT");
        ObjectOutputStream oos = new ObjectOutputStream(fos);
        oos.writeObject(student);
        oos.flush();
        oos.close();

        FileInputStream fis = new FileInputStream(basePath + "student.dat");
        ObjectInputStream ois = new ObjectInputStream(fis);
        Student deStudent = (Student) ois.readObject();
        ois.close();

        System.out.println(deStudent);
    }
}

As we can see, JDK’s native serialization mechanism is very simple for the user. Serialization is implemented by ObjectOutputStream, while deserialization is implemented by ObjectInputStream.

So how does JDK serialization work? Let’s take a look at the following diagram:

The serialization process is to continuously add special delimiters when reading object data, which are used to truncate during the deserialization process.

  • The header is used to declare the serialization protocol and version, for backward compatibility between high and low versions.
  • The object data mainly includes class name, signature, attribute name, attribute type, and attribute value. Of course, there are also beginning and ending data, except for the attribute value, which is the actual object value, all the others are metadata for deserialization.
  • When there are object references and inheritance, it involves recursively traversing the “write object” logic.

In fact, the core idea of any serialization framework is to design a serialization protocol that writes the object’s type, attribute types, and attribute values one by one in a fixed format into a binary byte stream to complete serialization, and then reads the object’s type, attribute types, and attribute values in a fixed format to recreate a new object, completing deserialization.

JSON #

JSON may be the most familiar serialization format, which is a typical Key-Value format without data types. It is a text-based serialization framework. There is a lot of information available online about the specific format and features of JSON, so I won’t go into detail here.

It is widely used in applications, whether it is used for Ajax calls in frontend web development, storing text-based data on disk, or communicating with RPC frameworks based on the HTTP protocol, JSON is often the preferred serialization format.

However, there are two issues to be aware of when using JSON for serialization:

  • JSON has a larger overhead in terms of extra space needed for serialization, which means it requires significant memory and disk overhead for services with large data volume.
  • JSON lacks types, but languages like Java are strongly typed and require reflection to handle this, therefore, the performance may not be optimal.

So if you choose JSON serialization for an RPC framework, the data transmitted between the service provider and the service consumer should be relatively small, otherwise it will significantly impact performance.

Hessian #

Hessian is a dynamic, binary, compact, and cross-language serializing framework. The Hessian protocol is more compact and efficient in terms of performance than JDK and JSON serialization. Additionally, the resulting byte size is smaller.

Here is an example code for using Hessian:

Student student = new Student();
student.setNo(101);
student.setName("HESSIAN");

// Convert the student object to a byte array
ByteArrayOutputStream bos = new ByteArrayOutputStream();
Hessian2Output output = new Hessian2Output(bos);
output.writeObject(student);
output.flushBuffer();
byte[] data = bos.toByteArray();
bos.close();

// Convert the previously serialized byte array to a student object
ByteArrayInputStream bis = new ByteArrayInputStream(data);
Hessian2Input input = new Hessian2Input(bis);
Student deStudent = (Student) input.readObject();
input.close();

System.out.println(deStudent);

Compared to JDK and JSON, Hessian is more efficient and generates smaller byte sizes. Therefore, Hessian is more suitable as a serialization protocol for RPC framework remote communication.

However, Hessian itself has some issues. The official version does not support certain common object types in Java, such as:

  • Linked series, such as LinkedHashMap and LinkedHashSet, but they can be fixed by extending the CollectionDeserializer class;
  • Locale class, which can be fixed by extending the ContextSerializerFactory class;
  • Byte/Short deserialization turns into Integer.

When practicing, you should pay extra attention to these situations.

Protobuf #

Protobuf is an internal hybrid language data standard developed by Google. It is a lightweight and efficient structured data storage format and can be used for structured data serialization. It supports languages like Java, Python, C++, and Go. When using Protobuf, you need to define an IDL (Interface Description Language) and then use the IDL compiler of different languages to generate serialization utility classes. Its advantages are:

  • The serialized size is much smaller compared to JSON and Hessian.
  • The IDL can clearly describe the semantics, thus helping and ensuring that types between applications are not lost without the need for XML parsers.
  • Serialization and deserialization are fast and do not require obtaining types through reflection.
  • It has good message format upgrade and compatibility, achieving backward compatibility.

Here is an example code for using Protobuf:

// IDL file format
syntax = "proto3";
option java_package = "com.test";
option java_outer_classname = "StudentProtobuf";

message StudentMsg {
  // Student number
  int32 no = 1;
  // Name
  string name = 2;
}

StudentProtobuf.StudentMsg.Builder builder = StudentProtobuf.StudentMsg.newBuilder();
builder.setNo(103);
builder.setName("protobuf");

// Convert the student object to a byte array
StudentProtobuf.StudentMsg msg = builder.build();
byte[] data = msg.toByteArray();

// Convert the previously serialized byte array to a student object
StudentProtobuf.StudentMsg deStudent = StudentProtobuf.StudentMsg.parseFrom(data);

System.out.println(deStudent);

Protobuf is highly efficient, but for languages with reflection and dynamic capabilities, it can be cumbersome to use in this way, which is not as convenient as Hessian. For example, with Java, the pre-compilation process is not necessary, and you can consider using Protostuff.

Protostuff does not depend on IDL files and can directly perform serialization and deserialization operations on Java domain objects. It is just as efficient as Protobuf and produces the same binary format as Protobuf. It can be considered a Java version of the Protobuf serialization framework. However, during its usage, I have encountered some unsupported scenarios that I want to share with you:

  • It does not support null values.
  • Protostuff does not support stand-alone Map and List collection objects; they need to be wrapped in an object.

How to Choose Serialization in RPC Frameworks? #

I just briefly introduced several serialization protocols, but there are actually more options available, such as MessagePack and Kryo. So, when faced with so many serialization protocols, how do we choose in RPC frameworks?

One factor you may consider is performance and efficiency, which is indeed a very important aspect. As mentioned earlier, serialization and deserialization are necessary processes in RPC calls, so the performance and efficiency of serialization and deserialization directly affect the overall performance and efficiency of the RPC framework.

What else do you think of?

Yes, there is also space overhead, which refers to the size of the binary data after serialization. The smaller the size of the serialized byte data, the smaller the amount of data transmitted over the network, and the faster the data transmission speed. As RPC involves remote calls, the network transmission speed directly affects the response time of requests.

Now, think again, what other factors can affect our choice?

That’s right, it’s the generality and compatibility of the serialization protocol. In the operation of RPC, serialization issues are probably the most frequently encountered and addressed questions. Businesses often provide feedback on this issue, such as the service caller being unable to parse an input parameter of a collection type, or the service caller failing to make a normal call after an additional property is added to the input parameter class, or a serialization exception occurring when making a call after upgrading the RPC version…

When choosing serialization, compared to the efficiency, performance, and size of the serialized data, the priority of generality and compatibility is higher. It directly affects the stability and availability of service calls. For service performance, reliability is obviously more important. We value whether the serialization protocol has good compatibility after version upgrades, supports more object types, is cross-platform and cross-language, and has been used by many people who have encountered and solved many issues. Then, we consider performance, efficiency, and space overhead.

There is one point I want to emphasize. In addition to the generality and compatibility of the serialization protocol, the security of the serialization protocol is also a very important consideration factor, and it should even be placed in the first place. Taking JDK’s native serialization as an example, it has vulnerabilities. If there are security vulnerabilities in serialization, online services are likely to be invaded.

Taking into account the above reference factors, let’s summarize these serialization protocols.

Our first choice is still Hessian and Protobuf because they meet our requirements in terms of performance, time overhead, space overhead, generality, compatibility, and security. Hessian is more convenient to use and has better object compatibility, while Protobuf is more efficient and has advantages in generality.

What issues should be noted when using RPC frameworks? #

Now that we have learned how to choose serialization in RPC frameworks, what serialization issues should we pay attention to during the usage process?

As I mentioned earlier, the most common problem I encountered in RPC operations is serialization. Except for the problems that occurred in the early RPC frameworks themselves, most of the problems are caused by improper use of the users. Next, let’s take stock of these frequently occurring human-induced problems.

Complex object construction: The object has many properties and multiple layers of nesting. For example, object A is associated with object B, object B aggregates object C, and object C is associated with or aggregates many other objects. The more complex the object’s dependencies, the more performance waste and CPU consumption occur when serializing and deserializing objects. This will seriously affect the overall performance of the RPC framework. Moreover, the more complex the object, the higher the probability of problems during serialization and deserialization.

Oversized objects: I often encounter cases where users consult why their RPC requests often timeout, and after investigation, I find that their input parameters are very large, such as a large List or Map, and the byte length after serialization exceeds megabytes. This situation also seriously wastes performance and CPU. Serializing such a large object requires a lot of time, which will directly affect the request’s response time.

Using classes as input parameters that are not supported by the serialization framework: For example, the Hessian framework naturally does not support classes like LinkedHashMap and LinkedHashSet, and in most cases, it is best not to use third-party collection classes, such as the collection classes in Guava. Many open-source serialization frameworks prioritize supporting native objects in the programming language. Therefore, if the input parameter is a collection class, it is best to choose the native and most commonly used collection classes, such as HashMap and ArrayList.

Objects with complex inheritance relationships: Most serialization frameworks serialize the properties of an object one by one when serializing the object. When there is an inheritance relationship, they will keep looking for the parent class and traverse the properties. Just like problem 1, the more complex the object relationships, the more performance waste there will be, and it is also easy to encounter serialization issues.

In the usage process of RPC frameworks, we should try to build simple objects as input and return value objects to avoid the above problems.

Summary #

Today we have delved into what serialization is and introduced several commonly used serialization methods such as JDK native serialization, JSON, Hessian, and Protobuf.

In addition to this foundational knowledge, we have focused on how to choose a serialization protocol in RPC frameworks. There are several important factors to consider, with priority given to security, universality, and compatibility, followed by performance, efficiency, and space overhead.

Ultimately, this is because the stability and reliability of service invocation are more important than the performance and response time of the service. In addition, for RPC calls, the most time-consuming and performance-intensive operations are mostly the ones performed by the service provider in executing business logic. In this case, the impact of serialization overhead on the overall cost of the service is relatively small.

In the process of using RPC frameworks, when constructing input and return value objects, keep the following points in mind:

  1. Objects should be as simple as possible, with minimal dependency relationships and a minimal number of attributes. Strive for high cohesion.
  2. The size of input and return value objects should not be too large, and certainly not passing large collections.
  3. Try to use simple, commonly used, and native objects in the development language, especially when it comes to collection classes.
  4. Avoid complex inheritance relationships in objects, and it is best to avoid situations where there are parent and child classes.

In practice, although RPC frameworks allow us to make remote calls as if they were local calls, during the transmission process of RPC frameworks, the fundamental role of input and return value objects is to transmit information. To improve the performance and stability of RPC calls as a whole, it is crucial that our input and return value objects are constructed as simple as possible.

Reflection after Class #

When selecting a serialization framework for an RPC framework, what other factors do you think need to be considered? What other excellent serialization frameworks do you know, and are they suitable for use in RPC calls?

Feel free to leave a comment and share your answers and experiences with me. You are also welcome to share this article with your friends and invite them to join in the learning. See you in the next class!