Techletter #94 | October 12, 2024
This week let’s build a simple recommendation engine. Recommendation engines are really complex pieces of software. So, in this letter, I try to build a simple one based on the book Guide 2 Data Mining.
As more and more data get’s generated, it would be very difficult for people to choose among thousands of products or would be difficult to watch a video or listen to a song that is perfect for you without a good recommendation engine.
I personally use YouTube Premium, and I have experienced better video suggestions than just using without a premium.
Collaborative filtering
It’s called collaborative because it makes recommendations based on other people in effect, people collaborate to come up with recommendations.
Suppose the task is to recommend a book to you. I search among other users of the site to find one that is similar to you in the books a person enjoys. Once I find that similar person I can see what that person likes and recommend those books to you. So now, how do we find the other person who has a similar interest just like you?
Here’s a simple 2D (dimensional) explanation. Suppose users rate books on a 5-star system zero stars mean the book is terrible, and 5 stars mean the book is great. Because I said we are looking at the simple 2D case, we restrict our ratings to two books: Neal Stephenson’s Snow Crash and Steig Larsson’s The Girl with the Dragon Tattoo.
Now, I would like to recommend a book to the mysterious Ms. X who rated Snow Crash 4 stars, and The Girl with the Dragon Tattoo 2 stars. The first task is to find the person who is most similar, or closest, to Ms. X. I do this by computing distance.
Manhattan distance:
The easiest distance measure to compute is what is called Manhattan Distance or cab driver distance. In the 2D case, each person is represented by an (x, y) point. I will add a subscript to the x and y to refer to different people. So (x1, y1) might be Amy, and (x2, y2) might be the elusive Ms. X. Manhattan Distance is then calculated by
|x1 - x2| + | y1 - y2 |
The absolute value of the difference between the x values plus the absolute value of the difference between the y values).
Now let’s calculate the distance for Amy:
|5 - 4| + |5 - 2| => 4
So the Manhattan Distance for Amy and Ms. X is 4.
Below is the Python code to calculate the Manhattan distance:
def manhattan_distance(rating_1, rating_2):
distance = 0
for key in rating_1:
if key in rating_2:
distance += abs(rating_1[key] - rating_2[key])
return distance
result = manhattan_distance(users["Hailey"], users["Veronica"])
print(result)
Now, let’s compute the nearest neighbor
def compute_nearest_neighbor(username, users):
distances = []
for user in users:
if user != username:
distance = manhattan_distance(users[user], users[username])
distances.append((distance, user))
distances.sort()
return distances
# user to whom you need to compute the nearest neighbor
nearest_neighbors = compute_nearest_neighbor("Hailey", users)
In case if you want to look at the complete code then you can look at this repo: gitHub.com/prajwalhaniya/_archive
Recommend after sorting
def recommend(username, users):
nearest_neighbor = compute_nearest_neighbor(username, users)[0][1]
recommendations = []
neighbor_ratings = users[nearest_neighbor]
user_ratings = users[username]
for artist in neighbor_ratings:
if not artist in user_ratings:
recommendations.append((artist, neighbor_ratings[artist]))
return sorted(recommendations, key = lambda artist_tuple: artist_tuple[1], reverse = True)
recommendations = recommend("Hailey", users)