016 the Web 2018 Paper Review How to Model the Aesthetics of Product Images

016 The Web 2018 Paper Review - How to Model the Aesthetics of Product Images #

“The Web Conference” (formerly known as “International World Wide Web Conference”) has been held since 1994 and has a history of more than 20 years. It is ranked first in the “Information Systems” category of international top academic conferences on Google Scholar.

Since its inception, The Web Conference has become a unique and authoritative academic conference in the field of the Internet. The conference includes excellent papers in various fields such as search, recommendation, advertising, databases, information extraction, and Internet security. Every year, it attracts thousands of scholars and engineers from around the world to share their latest research results.

The 2018 Web Conference was held in Lyon, France from April 23rd to 27th. The conference included 171 papers, 27 workshops, 19 tutorials, 61 poster papers, and 30 demos.

One of the characteristics of The Web Conference is that the papers cover a wide range of fields. It is a very time-consuming and laborious task to find valuable learning information from these papers. Here, I will share with you several papers that I think are the most valuable at this year’s conference, hoping to serve as a stepping stone.

Today, let’s take a look at a nominated paper titled “Aesthetic-based Clothing Recommendation.” This paper has a total of six authors, with most authors coming from Tsinghua University, except for two authors from the National University of Singapore and Emory University in the United States.

Main Contributions of the Paper #

In modern e-commerce recommendation systems, the appearance and quality of product images, especially for clothing and apparel, are crucial factors that influence purchasing decisions. Some previous recommendation systems have taken into account the properties of images, particularly by attempting to utilize both image and text information for multi-modal data understanding in order to provide more intelligent recommendations. However, most current approaches only consider basic image characteristics.

In terms of methodology, most similar works employ some form of deep neural network to extract image features, which are then combined with other features (such as text information mentioned earlier) to broaden our understanding of product information. However, the extracted image features typically do not explicitly model the “aesthetic” aspect of the images.

The authors of this paper believe that the “aesthetic” aspect of product images is a highly important attribute and that modeling it can significantly enhance the effectiveness of product recommendations. To summarize, one contribution of this paper is to propose a model that simultaneously models the aesthetic and general semantic properties of images. This is an innovative aspect that has not been covered in previous works, and we will now provide a detailed description of the architecture of this model.

Once the authors have extracted the aesthetic information from the images, the next issue is how to utilize these features. This paper adopts the approach of tensor factorization. As mentioned earlier when introducing recommendation systems, tensor factorization is an effective and commonly used recommendation model that utilizes contextual semantic information. Similarly to some previous works, the authors employ a three-dimensional tensor to express the relationships between users, products, and time. Additionally, the authors effectively incorporate image information into the tensor factorization framework, allowing the aesthetic information to influence the recommendation results.

Core Methods of the Paper #

After understanding the general idea of this paper, let’s take a look at the first core component of the paper: how to use deep neural networks to extract aesthetic information from images.

Firstly, the proposed model assumes that for each product, we have a comprehensive aesthetic label, as well as a detail label to express the “image style” of the product. The comprehensive aesthetic label is a score from 1 to 10, while the image style is textual image features such as “high exposure” or “contrasting colors”. Therefore, we need a neural network model to simultaneously model the aesthetic label and the detail image styles.

Specifically, the model proposed in this paper consists of two levels. The first level is used to explain the detail image styles. In the dataset used in this paper, there are a total of 14 image styles, and the authors used 14 sub-networks to target these styles. Each style corresponds to an independent sub neural network. Each sub neural network is a standard “convolutional neural network” (CNN). Their goal is to learn features as much as possible to represent the image style of each detail.

Once we have the 14 sub-networks of the first level, the characteristics learned by these sub-networks are integrated to form an intermediate feature layer, and then passed through a convolutional network to learn a neural network that explains the overall aesthetic score of the product.

In the paper, the authors mention that these two-level neural networks are not trained separately, but trained as a whole. This means that we simultaneously train the parameters of the 14 sub-networks at the bottom level and the parameters of the neural network for the higher-level aesthetic score.

After obtaining the aesthetic information of the images, the next step is to see how to use tensor decomposition for product recommendation.

Compared to traditional tensor decomposition, in this paper, the authors propose a novel tensor representation mode for product recommendation, called “Dynamic Collaborative Filtering” (DCF).

DCF believes that a user’s purchase decision for a product depends on two factors. First, whether the user has a preference for the product. Second, whether the product meets the “popularity” dimension in terms of time. The authors believe that only when these two conditions are satisfied at the same time, that is, when the user likes a product of the current season, will they make a purchase decision. Therefore, the authors used two matrix decompositions to represent these two assumptions.

The first matrix decomposition is for the user and product matrix, where we learn the preference level of users for products. The second matrix decomposition is for the time and product matrix, where we learn the popularity of time and products. Then, the authors multiply these two matrix decompositions (or matrices) to obtain a tensor that represents user preferences for products over time.

So, how do we incorporate the learned aesthetic information of the images into this new tensor learning framework? The authors did this by “extending” the previously mentioned two matrix decompositions.

Previously, we mentioned that this tensor decomposition is based on the assumption that a user’s purchase decision over time depends on whether the user has a preference for the product and whether the product meets the “popularity” dimension in terms of time. We used two matrix decompositions to express these two assumptions. Each matrix decomposition decomposes a large matrix into two vectors, such as the user and product matrix being decomposed into user characteristics and product characteristics.

Based on this, the authors added a product and image aesthetic information matrix behind this user and product matrix to mix these two types of information. In other words, the first assumption we mentioned earlier, the user’s preference for the product, is extended to the sum of two matrices, the user and product matrix and the product and image aesthetic information matrix. Similarly, time and product popularity are extended to the sum of time and product matrix and the product and image aesthetic information matrix. That is to say, the new model is a tensor decomposition composed of the product of two matrices, and each matrix is the sum of two matrices. This is the final model proposed by the authors.

Experimental Results #

The authors conducted experiments on an Amazon clothing dataset to validate the effectiveness of the proposed model. This dataset consists of nearly 40,000 users, over 20,000 products, and more than 270,000 purchase records. In addition to the model proposed in this article, the authors also compared several other algorithms, including a completely random algorithm, a popularity-based recommendation algorithm, a traditional matrix factorization model, and an algorithm that only utilizes basic image information without aesthetic features. The article reports the ranking accuracy NDCG and recall among other metrics.

From the experimental results, it is evident that the proposed model in this article outperforms the matrix factorization model and the algorithm that only uses basic image information, indicating the value of modeling product image aesthetics. Furthermore, the authors’ proposed method of tensor factorization has been proven to be effective.

Conclusion #

Today I have presented an excellent paper from this year’s World Wide Web Conference. The paper introduced how to model the aesthetic appeal of product images and incorporate the extracted information into a recommendation system based on tensor decomposition.

Let’s recap the key points together: First, we provided a detailed introduction to the problem that the paper aims to solve and its contributions. Second, we briefly introduced the core content of the proposed method. Third, we briefly shared the experimental results of the model.

Finally, I will leave you with a question to ponder: Is there a way to model the aesthetic appeal of images without having any labels?