28 Model Training Iii Detailed Explanation of Collaborative Filtering and Frequent Itemset Algorithms

28 Model Training III - Detailed Explanation of Collaborative Filtering and Frequent Itemset Algorithms #

Hello, I’m Wu Lei.

If you often use TikTok or enjoy watching movies, you may have had the experience of apps like these seemingly reading your mind and always recommending content that suits your taste. In fact, this kind of personalized recommendation heavily relies on recommendation algorithms in machine learning.

Today, in the last lecture of our model training series, we will explain the collaborative filtering and frequent itemset algorithms supported by Spark MLlib, using two interesting movie recommendation scenarios as examples. Just like in the previous lecture, let’s first take a look at the “panorama” below, which will give you an overview of the knowledge you have learned and will learn.

Movie Recommendation Scenario #

In today’s lecture, we will use the MovieLens dataset from Kaggle to build a simple movie recommendation engine using different algorithms. Although the MovieLens dataset contains multiple files, the main file used in this course is ratings.csv. Each entry in this file records a user’s rating for a movie, as shown in the table below.

In this table, the first column userId represents the user ID, movieId represents the movie ID, and rating represents the user’s rating for the movie. With a simple two-dimensional table like this, what can we do?

Don’t underestimate this data. With the right model algorithm, we can use it to build a basic recommender engine. In the Spark MLlib framework, there are at least two model algorithms that can achieve this: collaborative filtering and frequent pattern mining. The former is naturally designed for recommendation purposes, while the latter is a common unsupervised learning algorithm that can be flexibly applied to recommendation scenarios based on data characteristics.

Collaborative Filtering #

Let’s start with collaborative filtering. Literally, “filtering” is the goal, while “collaborative” refers to the method or approach. Simply put, the purpose of collaborative filtering is to “filter” a subset of items from a collection of items (such as a complete movie candidate set) that users may be interested in. And “collaborative” refers to using group behavior (interactions between all users and all items) to achieve this filtering.

It may sound a bit confusing, but the core idea of collaborative filtering is simple: “similar people tend to like similar sets of items”.

The interaction matrix may look simple, but it contains a lot of similarity information. With the help of appropriate model algorithms, we can explore the similarity between users, the similarity between items, and the similarity between users and items. Once these similarities can be quantified, we can naturally use them for recommendation based on similarity. Isn’t the idea simple?

Now the question is, how do we quantify these similarities? The answer is: matrix factorization.

Mathematically, given an interaction matrix C with dimensions (M, N), we can decompose it into the product of two matrices U and I. We can refer to U as the “user matrix” with dimensions (M, K); and I as the “item matrix” with dimensions (K, N).

In the user matrix and item matrix, K is a hyperparameter set by the developer. It is easy to see that each row in the user matrix U can be regarded as the user’s embedding, which represents the user’s feature vector. Similarly, each column in the item matrix can be regarded as the item’s embedding, representing the item’s feature vector.

As the saying goes, everything can be embedded. Once things are mapped to the same vector space, we can use methods such as Euclidean distance or cosine similarity to calculate the similarity between their vectors, thus quantifying various similarities (user-user, item-item, user-item).

Based on similarity calculation, we can implement various types of recommendations in different ways. For example, for user A, we can first search for the top 5 users who are most similar to him/her, and then recommend the movies that these users have liked to user A. This type of recommendation is called user-based recommendation.

Alternatively, for the items that user A has liked, we can search for the top 5 items that are most similar to these items, and then recommend these searched items to user A. This is called item-based recommendation.

In some cases, we can even directly calculate the similarity between user A and all items, and then recommend the top 5 items directly to user A.

Based on the above logic, we can also reverse the approach and recommend users to items (movies) from the perspective of items. It is easy to see that once the embedding transformation is completed, we can flexibly design a recommendation system based on similarity calculation.

Now, the next question is, how can we use Spark MLlib to obtain the user matrix and item matrix after factorization, get the embeddings of users and items, and finally design a simple recommendation engine based on the original interaction matrix?

As usual, let’s start with the code and use it to demonstrate this process.

import org.apache.spark.sql.DataFrame

// rootPath represents the root directory of the dataset
val rootPath: String = _
val filePath: String = s"${rootPath}/ratings.csv"

var data: DataFrame = spark.read.format("csv").option("header", true).load(filePath)

// Type casting
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.FloatType

// Convert ID fields to integers and convert rating to Float type
data = data.withColumn(s"userIdInt",col("userId").cast(IntegerType)).drop("userId")
data = data.withColumn(s"movieIdInt",col("movieId").cast(IntegerType)).drop("movieId")
data = data.withColumn(s"ratingFloat",col("rating").cast(IntegerType)).drop("rating")

// Splitting the training and test datasets
val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))

First, we prepare the training samples by creating a DataFrame from ratings.csv and converting the respective fields to the correct types for later use. Second, we define and fit the model to perform matrix factorization in collaborative filtering.

import org.apache.spark.ml.recommendation.ALS

// Build the model using ALS (Alternating Least Squares) for matrix factorization
val als = new ALS()
.setUserCol("userIdInt")
.setItemCol("movieIdInt")
.setRatingCol("ratingFloat")
.setMaxIter(20)

val alsModel = als.fit(trainingData)

It’s worth mentioning that in the Spark MLlib framework, for collaborative filtering, Spark does not use a direct solution (mathematically rigorous matrix factorization) but instead uses an approximate method. This method is ALS (Alternating Least Squares).

Specifically, given the interaction matrix C, for user matrix U and item matrix I, Spark first sets an initial value for U and assumes U is fixed. Under this assumption, Spark optimizes I by converting it into a regression problem and continuously fitting I until convergence. Then, with I fixed, it optimizes U using the regression approach until convergence. This process is repeated several times, and U and I gradually converge to the optimal solution, signaling the end of the training process.

Because Spark converts matrix factorization into a regression problem, we can use regression-related metrics to measure the training effectiveness of the ALS model, as shown below.

import org.apache.spark.ml.evaluation.RegressionEvaluator

val evaluator = new RegressionEvaluator()
// Set the metric to RMSE
.setMetricName("rmse")
.setLabelCol("ratingFloat")
.setPredictionCol("prediction")

val predictions = alsModel.transform(trainingData)
// Calculate RMSE
val rmse = evaluator.evaluate(predictions)

After validating the model’s performance, we can safely retrieve the trained user matrix U and item matrix I from the model. These matrices store the user embeddings and item embeddings, respectively.

alsModel.userFactors
// org.apache.spark.sql.DataFrame = [id: int, features: array<float>]

alsModel.userFactors.show(1)
/** Result:
+---+--------------------+
| id| features|
+---+--------------------+
| 10|[0.53652495, -1.0...|
+---+--------------------+
*/

alsModel.itemFactors
// org.apache.spark.sql.DataFrame = [id: int, features: array<float>]

alsModel.itemFactors.show(1)
/** Result:
+---+--------------------+
| id| features           |
+---+--------------------+
| 10|[1.1281404, -0.59...|
+---+--------------------+
*/

As mentioned earlier, with the embedding of users and items, we can design a recommendation engine flexibly. If we want to take a shortcut, we can also use the API provided by Spark MLlib to do recommendations. Specifically, we can use the relevant methods of the ALS model to recommend items to users or recommend users to items, as shown below.

// Recommend 10 movies for all users
val userRecs = alsModel.recommendForAllUsers(10)

// Recommend 10 users for each movie
val movieRecs = alsModel.recommendForAllItems(10)

// Recommend 10 movies for specific users
val users = data.select(als.getUserCol).distinct().limit(3)
val userSubsetRecs = alsModel.recommendForUserSubset(users, 10)

// Recommend 10 users for specific movies
val movies = data.select(als.getItemCol).distinct().limit(3)
val movieSubSetRecs = alsModel.recommendForItemSubset(movies, 10)

So far, we have introduced the core idea and working principle of collaborative filtering, and implemented a simple movie recommendation engine using the ALS algorithm provided by Spark MLlib. Next, let’s think about whether there are any other ideas to create a different recommendation engine.

Frequent Itemsets #

Frequent itemsets are a classic data mining algorithm that can be categorized into the field of unsupervised learning. Frequent itemsets can mine data sets to discover frequently associated data items and try to establish association rules between them, providing support for decision making.

For example, based on statistical analysis of millions of transaction records, a grocery store found that the ingredients “onion”, “ginger”, and “garlic” often appear together. In other words, people who buy “onion” and “ginger” often also buy several cloves of garlic, or people who buy green onions often also add ginger and garlic before checkout.

In this example of a shopping basket, (“onion”, “ginger”, “garlic”) is a frequent itemset, which is a collection of data items that frequently co-occur. And association rules like (“onion”, “ginger” -> “garlic”) and (“onion” -> “ginger”, “garlic”) are called association rules.

It is not difficult to see that based on frequent itemsets and association rules, we can provide simple recommendation capability. Using the example of (“onion”, “ginger”, “garlic”) mentioned earlier, for those who are holding a large onion and are about to check out, a savvy salesperson can recommend the newly arrived Hebei garlic or Shandong fresh ginger.

Returning to the movie recommendation scenario, we can also mine frequent itemsets and association rules based on historical data. For example, the movies (“The Eight Hundred”, “The Eight Hundred”, “Jin Gang Chuan”, “Long Jin Hu”) are frequent itemsets, and there is an association rule between (“The Eight Hundred”, “Jin Gang Chuan”) -> “Long Jin Hu”. So, for those who have watched “The Eight Hundred” and “Jin Gang Chuan”, we are more inclined to assume that they are likely to also like “Long Jin Hu”, and recommend this movie to them.

So, based on the MovieLens dataset and the development framework of Spark MLlib, how can we mine frequent itemsets and association rules?

The first step is data preparation. In the grocery store example, the store needs to collect the various fruits and vegetables that customers have purchased together, based on transactions. Similarly, to compute frequent itemsets on the MovieLens dataset, we need to collect all movies that the same user has watched, as shown in the diagram below.

To achieve this transformation, we only need one line of code.

// data is DataFrame created from ratings.csv
val movies: DataFrame = data
// Group by users
.groupBy("userId")
// Collect all movies watched by the user, rename the new collection column as movieSeq
.agg(collect_list("movieId").alias("movieSeq"))
// Keep only the movieSeq column and remove other columns
.select("movieSeq")

// movies: org.apache.spark.sql.DataFrame = [movieSeq: array<string>]

movies.show(1)
/**
Print Result
+--------------------+
| movieSeq|
+--------------------+
|[151, 172, 236, 2...|
+--------------------+
*/

After preparing the data, next, we can use the Spark MLlib framework to calculate frequent itemsets.

import org.apache.spark.ml.fpm.FPGrowth

val fpGrowth = new FPGrowth()
  // Specify the input column
  .setItemsCol("movieSeq")
  // Hyperparameter, minimum support coefficient for frequent items
  .setMinSupport(0.1)
  // Hyperparameter, minimum confidence coefficient for association rules
  .setMinConfidence(0.1)

val model = fpGrowth.fit(movies)

As we can see, defining and fitting a frequent itemset model is relatively simple, and the usage is similar to other model algorithms. However, there are two hyperparameters that need special attention, one is the minimum support coefficient set by setMinSupport, and the other is the minimum confidence coefficient specified by setMinConfidence.

The minimum support coefficient is used to set the “selection threshold” for frequent items. Here, we set it to 0.1. What does this mean?

For example, in the MovieLens dataset, there are a total of 7,120 users, and correspondingly, there are 7,120 movie sequence data in the movies DataFrame. For the combination (“The Eight Hundred”, “The Battle at Lake Changjin”, “The Sacrifice”), this combination will be considered a frequent item by the algorithm only if it occurs more than 712 (7120 * 0.1) times. In other words, the higher the minimum support coefficient, the fewer and more reliable frequent items the algorithm can find, and vice versa.

Similarly, the minimum confidence coefficient is used to constrain association rules. The value in the example is also 0.1. Let’s illustrate it with another example. Assuming the combination (“The Eight Hundred”, “The Battle at Lake Changjin”) has occurred together 1,000 times out of the 7,120 movie sequence data, in order for the association rule (“The Eight Hundred”, “The Battle at Lake Changjin” -> “The Sacrifice”) to hold, the combination (“The Eight Hundred”, “The Battle at Lake Changjin”, “The Sacrifice”) must have occurred at least 100 times (1,000 * 0.1). Similarly, the higher the minimum confidence coefficient, the fewer and more reliable association rules the algorithm can find, and vice versa.

After training the model, we can obtain frequent itemsets and association rules from it, as shown below.

model.freqItemsets.show(1)
/**
Print Result
+--------------------+----+
| items|freq|
+--------------------+----+
|[318, 593, 356, 296]|1465|
+--------------------+----+
*/

model.associationRules.show(1)
/**
Print Result
+--------------------+----------+------------------+
| antecedent|consequent| confidence|
+--------------------+----------+------------------+
|[592, 780, 480, 593]| [296]|0.8910463861920173|
+--------------------+----------+------------------+
*/

Based on the association rules, we can provide preliminary recommendation functionality. For example, for a user who has watched (“The Silent War”, “The Man Who Knew Infinity”, “Kundo: Age of the Rampant”, “Punished”), we can recommend the movie with ID 296 to him/her.

Key Review #

Alright, we have completed the three sections on model training! These sections covered a lot of content and included many algorithms. In order to give you a comprehensive understanding of them, I have organized the classification, principles, characteristics, and applicable scenarios of these algorithms into the following table for you to review at any time.

It is not difficult to see that there are numerous scenarios in machine learning, and in different scenarios, there are various algorithms for us to choose from. Mastering the principles and characteristics of these algorithms helps us efficiently select and train models to solve specific problems in different scenarios.

Regarding algorithm optimization and application, you still need to further validate and consolidate them through daily practice. I also welcome you to share your experiences and insights in the comment section. Let’s keep going together!

Practice after each lesson #

Can you talk about the advantages and disadvantages of the two recommendation approaches introduced in this lecture (collaborative filtering and frequent itemsets)?

Feel free to share any learning gains or questions you have with me. See you in the comment section!