Reccommendation Systems aim at finding items that users might be interested in given a set of characteristics. Recommendation systems are generally used on online stores and websites such as Netflix.(Mitra et al. 2016). The process of creating personalised recommendation for users is described in detail by Adomavicius & Tuzhilin (2005).
Leskovec et al. (2014) state that there are two main architectures for recommendation systems. First are content-based systems. These focus on the characteristics of the items. On a content base system, users are recommended items that are similar to the ones that they have already consumed. Second, are collaborative filtering systems which focus on the relationship between customers and items. Ansari et al. (1999) describes collaborative filtering as an algorithm that mimics word-of-mouth communication because the algorithm suggests customers, items that people similar to them have purchased.
This notebook shows how to create a simple recommendation system using trip advisor data. The aim is therefore to create restaurant recommendations. I first created a simple system that ranks all restaurants and returns the top rated. Second I created a system that gives recommendations based on a particular restaurant. For example, if you feed the alghorithm the name of a restaurant, it will return a list of similar ones to it.
# import packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from wordcloud import WordCloud
from math import log, sqrt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import warnings
import nltk
nltk.download('punkt')
warnings.filterwarnings('ignore')
%matplotlib inline
# read data
TAdata = pd.read_csv('./TA_restaurants_curated.csv')
# copy data
data = TAdata
# look at data
data.head()
data.shape
# eliminate columns we are not using
data = data.iloc[:,[1,2,3,4,5,6,7,8,10]]
# replace price range values with 'cheap', 'medium' and 'expensive'
data['Price Range'].replace(['$', '$$ - $$$', '$$$$'], ['cheap', 'medium', 'expensive'], inplace=True)
# make city and name of restaurant lowercase
data['City'] = data['City'].str.lower()
data['Name'] = data['Name'].str.lower()
Simple recommenders are basic systems that recommends the top items based on a certain metric or score.
The following are the steps involved:
$$ WR = \frac{v}{v+m}R + \frac{m}{v+m}C $$ where:
# calculate C first
C = data['Rating'].mean()
print('The mean review across all restaurants is ', str(C)[0:5])
# caclulate m
# What is the minimum number of review a restaurant need to have to be included in this chart
m = data['Number of Reviews'].quantile(0.50)
print('The minimum number of reviews required to be listed in the chart is',m)
# get restaurants that have at leat m reviews
SR_data = data.copy().loc[data['Number of Reviews'] >= m]
print(str(SR_data.shape[0]) + ' restaurants can be included in the chart')
# create a function that calculate the weighted review for each restaurant
def weighted_review(x, m=m, C=C):
# v is the number of reveiws of a particular restaurant
v = x['Number of Reviews']
# R is the average rating
R = x['Rating']
# weighted rating
WR = (v/(v+m) * R) + (m/(m+v) * C)
# return weighted rating
return WR
# create a new column of dataframe called 'score' where to store this value
SR_data['score'] = SR_data.apply(weighted_review, axis=1)
# filter restuarants based on city and price range and then tell me the best 15 according to my score
# input city you want to select
city = str(input('Insert City (lower case please): '))
# input price range
price_range = str(input('Insert Price Range: "cheap", "medium", "expensive" or "all" '))
# if the price range is 'all'
if price_range == 'all':
# only filter the city
city_data = SR_data.loc[SR_data['City'] == city,:]
else:
# otherwise filter the city and price range
city_data = SR_data.loc[(SR_data['City'] == city) & (SR_data['Price Range'] == price_range),:]
# sort restaurant by score
city_data = city_data.sort_values('score', ascending=False)
# show top 10 rated resturant in that city and price range
city_data[['Name', 'Cuisine Style', 'Rating', 'Price Range']].head(10)
This is a system that recommends restaurants that are similar to others. More specifically, we will compute pairwise similarity scores for all restaurants based on their cuisine style and price range and recommend restaurants based on that similarity score.
# make a description column by adding the couisine style and the price range.
# make cuisine style and price range columns strings
cols = ['Cuisine Style', 'Price Range']
for col in cols:
data[col] = data[col].astype(str)
new_col = []
# for each row of cuisine style, eliminate symbol characters
for row in np.arange(data.shape[0]):
c = data['Cuisine Style'][row].replace("[", "").replace(']', '')
d = data['Price Range'][row]
# attach price range to the string
e = c + ' ' + d
# append string to new list new_col
new_col.append(e)
# add this column on dataset and name it description
data['description'] = new_col
The similarity measure between each pair of restaurants I will be using is cosine similarity.
Before declaring the fucntion that calculates similarity, it is necessary to:
Then I am going to declare a function that returns recommendations. This function works as follows:
# reduce dataset
m = data['Number of Reviews'].quantile(0.95)
CR_data = data.copy().loc[data['Number of Reviews'] >= m]
CR_data = CR_data.reset_index(drop=True)
# create matrix with descriptions
tfidf = TfidfVectorizer(stop_words='english')
CR_data['description']= CR_data['description'].fillna('')
tfidf_matrix = tfidf.fit_transform(CR_data['description'])
# calculate similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
def get_recommendations(name, city = 'all', cosine_sim=cosine_sim):
city = city
# reset indeces
indices = pd.Series(CR_data.index, index=CR_data['Name']).drop_duplicates()
# Get the index of the movie that matches the title
idx = indices[name]
# Get the pairwsie similarity scores of all restaurant with that restaurant
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the restaurants based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the restaurants indices
res_indices = [i[0] for i in sim_scores]
# get name, city and description of restaurant
sim_res = CR_data[['Name','City','description']].iloc[res_indices]
# if city is set to all,
if city != 'all':
# only show the ones from that city
r = sim_res.loc[sim_res['City'] == city, :].head(10)
else:
# else show all
r = CR_data[['Name','City','description']].iloc[res_indices].head(10)
# Return the top 10 most similar restaurants
return r
# ger recommendations for resturants similar to
name = str(input('Insert the name of the restaurant (lower case): '))#'Restaurant Gordon Ramsay'
city = str(input('Insert the city (lower case): '))
print('If you liked ', name, 'then, in ', city, ' you could try:' )
get_recommendations(name, city=city)