42 Applications of Kafka Streams in the Financial Domain

42 Applications of Kafka Streams in the Financial Domain #

Hello, I’m Hu Xi. Today, I’d like to share with you the topic of: Application of Kafka Streams in the Financial Field.

Background #

There are many contents included in the financial field. Today, I will mainly share how to use big data technology, especially the Kafka Streams real-time computing framework, to help us better understand enterprise user insights.

It is well known that the cost of acquiring customers in the financial field is quite high. The cost of acquiring high-net-worth white-collar customers in first-tier cities can usually reach thousands of yuan. Faced with such a huge cost pressure, financial enterprises not only need to reduce the cost of customer acquisition through advertising, but also need to do fine-grained operations to maximize the value of customer lifecycle (Custom Lifecycle Value, CLV).

One important way to maximize value is to gain better user insights, and user insights require you to have a deeper understanding of your customers, which is known as Know Your Customer (KYC). It is important to truly focus on customers and continuously meet their needs.

To achieve KYC, the traditional approach is to spend a lot of time meeting customers face-to-face and communicating to understand their situation. However, the data obtained in this way is often not authentic, as customers have a latent self-preservation consciousness and it is difficult to truly understand their real demands through short-term face-to-face communication.

On the contrary, big data information that penetrates into all aspects of people’s daily lives represents customers’ actual needs. For example, which websites customers frequently browse, what they have purchased, and their favorite types of videos. These data may seem random, but they all reflect customers’ most authentic thoughts. By aggregating these data, we can construct a complete profile of customers, which is called user profiling technology.

User Profile #

User profiles may sound mysterious, but in fact, you should be quite familiar with them. Many basic pieces of information about you, such as gender, age, industry, salary, and hobbies, are all part of a user profile. For example, we can describe someone like this: Mr. X, male, 28 years old, unmarried, with a salary ranging from 15,000 to 20,000 yuan, working as a big data development engineer, residing in the Tian Tong Yuan community in Beijing, often working overtime, and enjoying anime or games.

In fact, this series of descriptions is a typical user profile. Simply put, the core task of building a user profile is to tag customers or users. The aforementioned series of descriptions is representative of the labels within a user system. User profile systems use tagging to provide customers to business personnel, thereby achieving precise marketing.

ID Mapping #

The benefits of user profiling are self-evident, and the more tags one has, the more comprehensive their representation of a person’s various aspects. However, before assigning specific tags, it is essential to determine “who you are” in any user profiling system. This issue is also known as the ID identification problem.

ID, or Identification, refers to user identity representation on the internet. Common types of IDs that can identify user identity information include:

ID numbers: These are the most representative ID information for identity, with each ID number corresponding to only one person.
Phone numbers: Phone numbers generally provide a good representation of identity. Although situations may arise where one person has multiple phone numbers or one phone number is used by multiple people at different times, it is still popular for most internet applications to use phone numbers to represent user identities.
Device IDs: In the era of mobile internet, this mainly refers to device IDs of mobile devices, such as mobile phone device IDs or IDs of mobile terminal devices like Mac or iPad. Particularly for mobile phone device IDs, they have the function of locating and identifying users in many scenarios. Common examples of device IDs are IDFA for iOS devices and IMEI for Android devices.
Application registration accounts: This belongs to a relatively weak category of IDs. Each person may register different accounts on different applications, but there are still many people who use common registration account names, which have a certain level of association and identification.
Cookies: In the PC era, browser cookies were essential data, and they were one of the important means of representing user information on the internet. However, with the advent of the mobile internet era, cookies have become less and less prevalent, and their value as ID data has diminished. I personally even believe that cookies may be abandoned when building a new generation of user profiling systems based on the mobile internet.

When building a user profiling system, we continuously collect various personal user data from multiple data sources. Normally, these data will not all include the aforementioned ID information. For example, when reading browser history, you obtain cookie data, while when collecting user access behavior data on a specific app, you obtain the user’s device ID and registration account information.

If these data all represent the same user, how can our user profiling system identify them? In other words, you need a means or technique to connect or map each ID. This is the ID mapping problem in the field of user profiling.

Real-time ID Mapping #

Let me give you a simple example. Suppose we have a financial management user named Zhang San. He first accessed a financial product on his iPhone, then registered an account for the same product on his Android phone, and finally logged into the account on his computer and made a purchase. ID Mapping is the process of aggregating user information from different platforms or devices and identifying all the associated ID information.

Real-time ID Mapping requires us to analyze the data collected from various devices in real-time and complete the ID Mapping process in a very short period of time. The shorter the time it takes to connect user IDs, the faster we can label them and extract more value from user profiling.

From the perspective of real-time computing or stream processing, Real-time ID Mapping can be transformed into a stream-table join problem. It means that we need to connect a stream and a table in real-time.

Each event or message in the message stream contains certain information about an unknown user. This information can be the user’s visit record on a webpage or their purchase behavior data. These messages may include various types of ID information mentioned earlier, such as device IDs or registered account information in visit records, and identification information like ID cards or phone numbers in purchase behavior data.

The other side of the join, the table, contains all the ID information for each user. As the join progresses, the table will become more enriched with different categories of IDs. In other words, the data in the stream will be continuously supplemented into the table, achieving a complete mapping of all the user’s IDs.

Implementing Kafka Streams #

Now let’s take a look at how to use Kafka Streams to implement real-time ID mapping for a specific scenario. For ease of understanding, let’s assume that ID mapping only involves the ID card number, phone number, and device ID. Here is the schema written in Avro format:

{
  "namespace": "kafkalearn.userprofile.idmapping",
  "type": "record",
  "name": "IDMapping",
  "fields": [
    {"name": "deviceId", "type": "string"},
    {"name": "idCard", "type": "string"},
    {"name": "phone", "type": "string"}
  ]
}

By the way, Avro is a commonly used serialization encoding mechanism in Java or the big data ecosystem, such as directly using JSON or XML to store objects. Avro can greatly save disk space or network I/O transmission volume, so it is widely used for data transmission in large data volumes.

In this scenario, we need two Kafka topics, one for constructing a table and the other for building a stream. The message format for both topics is the IDMapping object mentioned above.

When a new user fills in their phone number to register an app, a message will be sent to the first topic. All subsequent records of the user’s interactions on the app will also be sent as messages to the second topic. It’s worth noting that messages sent to the second topic may carry other ID information, such as phone number or device ID. As I just mentioned, this is a typical real-time stream-table join scenario. Once connected, we can complete all user data and achieve ID mapping.

Based on this design idea, I will first provide the complete Kafka Streams code, and then I will explain the key parts in detail:

package kafkalearn.userprofile.idmapping;

// omit imports……

public class IDMappingStreams {


    public static void main(String[] args) throws Exception {

        if (args.length < 1) {
            throw new IllegalArgumentException("Must specify the path for a configuration file.");
        }

        IDMappingStreams instance = new IDMappingStreams();
        Properties envProps = instance.loadProperties(args[0]);
        Properties streamProps = instance.buildStreamsProperties(envProps);
        Topology topology = instance.buildTopology(envProps);

        instance.createTopics(envProps);

        final KafkaStreams streams = new KafkaStreams(topology, streamProps);
        final CountDownLatch latch = new CountDownLatch(1);

        // Attach shutdown handler to catch Control-C.
        Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
            @Override
            public void run() {
                streams.close();
                latch.countDown();
            }
        });

        try {
            streams.start();
            latch.await();
        } catch (Throwable e) {
            System.exit(1);
        }
        System.exit(0);
    }

    private Properties loadProperties(String propertyFilePath) throws IOException {
        Properties envProps = new Properties();
        try (FileInputStream input = new FileInputStream(propertyFilePath)) {
            envProps.load(input);
            return envProps;
        }
    }

    private Properties buildStreamsProperties(Properties envProps) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id"));

props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
return props;

In this Java class code, the most important method is the buildTopology function. It constructs all the logic to establish the ID Mapping.

In this method, we first create an instance of the StreamsBuilder object, which is the first step to build any Kafka Streams application. Then, we read the configuration file to get all the Kafka topic names that we want to read and write. In this example, we need 4 topics, and their roles are as follows:

streamTopic: This topic stores various behavior data that occurs after a user logs into the app. The data is in the format of JSON strings representing the IDMapping object. You might ask, didn’t we create Avro schema files earlier? Why are we using JSON here? The reason is that the community version of Kafka does not provide support for Avro serialization/deserialization classes. If I want to use Avro, I have to switch to Confluent’s Kafka, which deviates from the original intention of this column to introduce Apache Kafka. Therefore, I’m still using JSON for illustration purposes. Here, I only use the Avro Code Generator to provide the set/get methods for the fields of the IDMapping object. You can also use Lombok.
rekeyedTopic: This topic is an intermediate topic that extracts the phone number from the streamTopic and uses it as the message key while keeping the message value unchanged.
tableTopic: This topic stores the phone number filled in by users when they register for the app. We need to use this topic to construct the table data for the join operation.
outputTopic: This topic stores the output information after the join operation, i.e., the IDMapping object that connects all the user’s ID data. After converting it to JSON, we write it to this topic.

The first step in buildTopology is to construct the table, i.e., the KTable object. We modify the initial message stream to use the phone number of user registration as the key and construct an intermediate stream. We then write this stream to the rekeyedTopic and finally use the builder.table method to construct the KTable. This way, whenever there is a new user registration, the KTable will have a new record.

With the table constructed, we continue to build the message stream to encapsulate the behavior data of users after logging into the app. We similarly extract the phone number as the key for the join operation and use the leftJoin method of KStream to associate it with the KTable object from the previous step.

During the association process, we extract the information from both sides and try to supplement the generated IDMapping object as much as possible. Then, we return this generated IDMapping instance to the new stream. Finally, we write it to the outputTopic for storage.

Thus, with less than 200 lines of Java code, we have implemented a real-time ID Mapping task for a real scenario. In theory, you can continue to expand this example to any number of ID mappings, even with other tags. The connection principle remains the same. In my project, I used Kafka Streams to help me implement some features of a user profile system, and ID mapping is one of them.

Summary #

Alright, let’s summarize. Today, I presented a use case of Kafka Streams in the financial industry, focusing on demonstrating how to perform real-time joins between streams and tables using the join function. In fact, Kafka Streams offers much more functionality than this. I recommend you read the tutorials on the official website and consider implementing some of your lightweight real-time computing tasks using Kafka Streams.

Open Discussion #

Finally, let’s discuss a question. In the example above, why do you think I used the leftJoin method instead of the join method? (Hint: You can compare left join and inner join in SQL.)

Feel free to write down your thoughts and answers. Let’s discuss together. If you found it helpful, feel free to share this article with your friends.