014 Wsdm 2018 Paper Review How Jd.com Team Mines for Substitute and Complementary Product Information

014 WSDM 2018 Paper Review - How JD #

This week, we will review several papers from WSDM. On Monday, we shared an article from the Google team, which focused on using click models to more effectively estimate position bias and learn better ranking algorithms.

Today, we will introduce the best student paper from WSDM 2018, entitled “A Path-constrained Framework for Discriminating Substitutable and Complementary Products in E-commerce” (A Path-constrained Framework for Discriminating Substitutable and Complementary Products in E-commerce). This paper comes from JD’s Data Science Laboratory.

Author Information #

All the authors of this paper are from JD Data Science Laboratory. Here we give a brief introduction to some of the main authors.

The third author, Zhaochun Ren, currently serves as a Senior Research and Development Manager at JD Data Science Laboratory. He received his Ph.D. in Computer Science from the University of Amsterdam in 2016, under the supervision of renowned information retrieval expert Maarten de Rijke. Zhaochun Ren has published several papers on various topics such as information retrieval, text summarization, and recommendation systems in international conferences and journals.

The fourth author, Jiliang Tang, is currently an Assistant Professor at Michigan State University. He obtained his Ph.D. in Computer Science from Arizona State University in 2015, under the guidance of renowned data mining expert Professor Huan Liu. He joined Michigan State University in 2016, previously working as a scientist at Yahoo Research. Jiliang Tang is a rising star in the field of data mining and has already published more than 70 papers, with over 4,000 citations.

The last author, Dawei Yin, is currently the Senior Director of JD Data Science Laboratory. He joined JD in 2016, having previously worked at Yahoo Research in various positions, including research scientist and senior manager. Dawei Yin obtained his Ph.D. in Computer Science from Lehigh University in 2013, under the supervision of information retrieval expert Professor Davison. He has already published many high-quality research works. Dawei Yin and the author of this paper were laboratory classmates during their doctoral studies and colleagues during their time at Yahoo Research.

Main Contributions of the Paper #

First, let’s take a look at the main contributions of this paper and clarify what problem it solves in a particular scenario.

For industrial-level recommendation systems, the generation of recommendation results usually involves two steps. The first step is to generate a candidate set, which mainly selects several hundred to several thousand products that users may purchase from a massive number of items. The second step is to rank all products in the candidate set using complex machine learning models.

This paper primarily explores how to generate candidate products more effectively, specifically how to generate “substitutes” and “complements” to enrich the user’s purchasing experience.

So, what are substitutes and complements?

According to the definition in this paper, substitutes are products that users consider can be mutually replaced, while complements are products that users will purchase together. Mining these products is not only significant for generating candidate sets but also helpful for recommendation results in certain scenarios, such as recommending other complementary products to users after they have purchased a certain item.

Although substitutes and complements are important sources of recommendations for Internet e-commerce, there is not much literature or known methods for effectively mining these two types of products. Furthermore, a major challenge lies in the “sparsity” of data. Since substitutes or complements involve at least two products, for a massive product database, the vast majority of products have not been simultaneously considered or purchased together. Therefore, solving the problem of sparse data is a big challenge.

On the other hand, product attributes are complex. The same product may be a substitute in some cases but a complement in others. Therefore, how to mine the attributes of products in a complex user behavior chain becomes a difficult problem. Many traditional methods have a static perspective on this problem and cannot effectively explore the potential of all products.

In summary, this paper has two important contributions. First, the authors propose a framework for “multi-relation” learning to mine substitutes and complements. Second, to address the issue of sparse data, two “path constraints” are used to distinguish between substitutes and complements. The authors validate the effectiveness of these two new ideas in real-world data.

Core Method of the Paper #

The first step of the proposed method in the paper is to learn the representation of products using relationships. The paper does not differentiate between substitutes and complements. The learning of representation is mainly achieved through a method similar to Word2Vec.

In other words, if there is a connection between products, whether it is a substitute relationship or a complement relationship, they are considered positively correlated, while all other products are considered negatively correlated. Thus, we can use the idea of Word2Vec to learn the representation vectors of products, so that the dot product results of vectors between positively correlated products are higher, while the dot product results of negatively correlated products are lower. This step is essentially an application of Word2Vec on a set of products.

The representation of each product obtained through the first step is a relatively general comprehensive representation. As mentioned earlier, in different situations, products may exhibit different attributes. Therefore, we need to characterize different representations of products based on different scenarios. The method used in this paper is that each product has a corresponding representation for different types of relationships. This relationship-specific representation is obtained by “projecting” the global representation we learned earlier onto the specific relationship. What needs to be learned here is a projection vector.

The third step is to explore substitute relationships and complement relationships. This paper uses a less common technique called " Fuzzy Logic " to express the constraint relationships between products. Here, we do not need to have a complete understanding of fuzzy logic, we only need to know that it is a technique that converts “hard logic relationships” (Hard Constraints) into “soft logic relationships” (Soft Constraints) expressed through probability methods.

In this paper, the authors focus on how to use a series of rules to solve the problem of data scarcity. Specifically, it is about using some people’s observations of substitute relationships or complement relationships.

For example, if Product A is a substitute for Product B, it is very likely that the category where Product A belongs is a substitute for the category where Product B belongs. Another example is that if Product B is a substitute for Product A, and Product C is a substitute for Product B, and if A, B, and C all belong to the same category, then we can also consider Product C as a substitute for A.

In summary, the authors artificially propose a series of rules, or called constraint relationships, hoping to use these constraint relationships to maximize the impact of existing data as much as possible. Of course, we can see that such constraints are not 100% correct, which is also the reason why the authors want to use “soft logic relationships” to constrain, because it is actually a probability problem.

The entire proposed model is ultimately an optimization objective function, which consists of the initial comprehensive representation of items, the learning of projection under specific relationships, and the learning of soft logic relationships, these three components together form the final optimization objective.

Experimental Results of the Method #

This article conducted experiments using five major categories of products from JD.com. The amount of product data used far exceeds that of a previous publicly available dataset from Amazon. The author primarily compared a model from a team at the University of California, San Diego with several classical matrix factorization models, as well as a collaborative filtering-based model.

Overall, the model proposed in this article outperformed other models by a large margin in both the relationship prediction subtask and the final ranking task. Furthermore, the authors demonstrated that logical relationships can indeed help the objective function distinguish between substitute and complementary products.

Summary #

Today I discussed an article from the JD Data Science team at WSDM 2018. The article explains how to use multi-relationship learning and fuzzy logic to mine substitute and complementary information for products, and then train more effective ranking algorithms.

Let’s recap the key points: First, we briefly introduced the author group information of this article. Second, we detailed the problem this article aims to solve and its contributions. Third, we briefly introduced the core content of the proposed method and the results of the experiments.

Finally, I’ll leave you with a question to ponder: Is the relationship between complementary products or substitute products bidirectional or unidirectional, and why?