The aim of this project is to perform customer segmentation on a dataset of e-commerce transactions. After dividing customers into segments, linear regression is performed to predict the number of orders placed by each customer depending on a variety of different variables.
The data in question is an open source dataset created by the UCI Machine Learning Repository containing 541,909 E-commerce transactions dated from 01/12/2010 until 09/12/2011 (UCI, 2018).
I have downloaded the dataset from Kaggle.com, a platform for predictive modelling and analytics competitions which stores open source datasets uploaded by companies or users (Kaggle, 2018).
The data set contains transactional data of a UK-based online retailer that sells unique all occasion gifts to wholesales and end users. The data set variabled are explained below:
The project is divided into four main sections:
The data was downloaded and unzipped manually in .csv format in the working directory of the project.
The data set contained characters that were not encoded as UTF-8 therefore, it was opened in a text editor and encoded to UTF-8 before reading it into jupyter notebook.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import pysal as ps
import geopandas as gpd
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
#this line reads .csv files into the script
data = pd.read_csv('data.csv')
Before pre-processing and performing analysis, I explored the data with descriptive statistics. Descriptive statistics is used to describe the basic features of the dataset we want to analyse and, together with data visualization, it allows us to understand the shape and general attributes of our data (Trochim, 2006).
#look at first 10 rows of the dataset
data.head(10)
#describe() returs count, mean, standard deviation, minimun value and maximum value
# of the numerical variables of the dataset
data.describe()
From this first glimpse at the data it is important to note that:
Assigning data to the right type makes sure that, when performing analysis, calculations are consistent, and data is processed in the right way.
To check to what type the variables are assigned to by default, I am using the .info()
function.
#get information about the dataset and what type each variable is assigned to by default
data.info()
From this, I am choosing to format the “InvoiceDate” column to a datetime
variable, and the “Country” column to a categorical variable.
%%time
#make date format consistent
data['InvoiceDate']=pd.to_datetime((data['InvoiceDate']))
#make country a categorical variable
data["Country"] = data["Country"].astype('category')
As the summary statistics suggested, data is missing from the data set. For this reason, I decided to create a function that returns a table with the number of missing observations per variable.
## This function counts and creates a dataframe containing the nuber of missing values per column.
# Declare the name of the fuction
def count_missing(data):
# creates a dataframe containing the sum of all missing values
missing_count = pd.DataFrame(data.isnull().sum(),\
# names column "Count" and sorts values in descending order
columns=['Count']).sort_values(by=['Count'],\
ascending=False)
return missing_count
#apply the function to the data
count_missing(data)
As expected, 13,5080 CustomerIDs are missing from the dataset as well as 1,454 product descriptions.
To deal with missing values, I decided to fill each blank description cell with “Missing”, and to discard all the rows with a missing CustomerID. This decision was made because the aim of this project is to explore the data at customer level, therefore all observations that are not assigned to a particular customer cannot be used for analysis.
# replace missing descriptions with "Missing"
data["Description"] = data["Description"].fillna("Missing")
#delete all rows with a missing "CustomerID"
data.dropna(axis = 0, subset = ['CustomerID'], inplace = True)
#check that there aren't any missing values
count_missing(data)
As explained in the introduction of this report, cancelled orders are the ones whose “InvoiceNo” start with a “C”. This allowed me to create a binomial variable that indicates whether an order was cancelled or not. The new variable “Cancel” takes the value of 1 of the order was cancelled, or the value of 0 if the order was not cancelled.
To do so, I have created a function named cancel()
which assigns the value 0 to all rows, however, if the first character of each row in the column "InvoiceNo" is "C", it assigns the value of 1
%%time
#create a function that will assign 0 to all variables unless the first character of 'InvoiceNo' in 'C'
def cancel(row):
value = 0
if row['InvoiceNo'][0] == 'C':
value = 1
return value
# Create a new column 'Cancel' to attach to 'data' and set it to the value returned
#by the function cancel().
# The code 'axis=1' makes the apply function process the dataset by row,
#as opposed to by column which is the default option.
data['Cancel'] = data.apply(cancel, axis=1)
# the %%time code returns how much time is needed to run this function.
#check that there are some cancelled orders and that the new column is in the data
data[data['Cancel']== 1].head(3)
To better understand the frequency of cancelled orders, I calculated the probability of an order to be cancelled, and visualised how many orders were cancelled in the past year with a vertical bar graph.
#check how many cancelled VS non-cancelled orders there are.
canceled = data['Cancel'].value_counts()
# rename the axis so that they show in the picture
canceled = canceled.rename({0: 'Non-canceled', 1: 'Canceled'})
# bar graph of canceled vs non-canceled orders
canceled.plot.barh();
#calculate the probability of an order to be cancelled.
pco = (8905/(397924+8905))
pco
During the past year, 8905 orders were cancelled, this are 2.2% of all orders placed.
To continue, I have decided to calculate the total spending of each transaction by multiplying “Quantity” by “UnitPrice”. I then attached the result to a new variable called “Tra_Spending”.
data['Tra_Spending']= (data['Quantity'])*(data['UnitPrice'])
data.head(3)
The next step is to group all observation by customerID. This will make it easier to explore customers characteristics. While doing so, I created a variable called "days_as_customer" by taking the difference in days between the date of a customer's last order, and the date of their first order.
#group dataset by cystomerID to start creating customer profiles.
#The following code groups the data by customerID and sum quantity, transaction spending, and number of cancelled orders
aggQ=data.groupby('CustomerID')['Quantity','Tra_Spending', 'Cancel'].sum()
aggQ.head(3)
#group by custmer and count unique invoiceNo to see how many oder the customer has placed in the time the data spans.
Inv = data.groupby('CustomerID')['InvoiceNo'].count()
Inv.head(3)
#### Create a varible that shows how much time a person has been a customer.
#group by customerID and select the mininum date (i.e. the date at which the first order was placed)
#as_index=false means that customer ID is a column and not the index
days= data.groupby('CustomerID', as_index=False)['InvoiceDate'].min()
#edit names of columns
days.columns = ['CustomerID', 'Earliest_Invoice']
#create a column with the date of the latest order placed
days['Latest_Invoice'] = pd.to_datetime((data['InvoiceDate']).max())
days.set_index('CustomerID')
days.head(3)
#calculate how many days the person has been a customer by subtracting the date of the latest invoice
# to the date of the first invoice
days['days_as_customer'] = 1 + (days.Latest_Invoice-days.Earliest_Invoice).astype('timedelta64[D]')
#delete earliest invoice and laterst invoice columns
days.drop(['Earliest_Invoice', 'Latest_Invoice'], axis=1, inplace=True)
days.head()
#grouped data by customerID and pasted the value of 'Country'
country = data.groupby('CustomerID')['Country'].first()
country.head(3)
#group by 'customerID' and average unitprice to see if the person buys expensive items or not
meanex = data.groupby('CustomerID')['UnitPrice'].mean()
meanex.head(3)
# create list of datafames to join that have the same index
to_join=[Inv, country, meanex]
# use join() to join aggQ with the three dataframes listed above
joined = aggQ.join(to_join)
joined = joined.join(days.set_index('CustomerID'))
# edit column names
joined.columns = ['total_quantity', 'total_spending',\
'total_cancelled_orders', 'total_orders_placed',\
'country_of_origin', 'average_price_of_items', 'days_as_customer']
#delete all negative values
joined = joined.drop(joined[(joined.total_quantity < 0)].index)
joined = joined.drop(joined[(joined.total_spending < 0)].index)
The new dataset contains the following variables:
# look at first rows of data set
joined.head()
# look at the shape of the dataset joined.
# this gives us the number of customers in the dataset because its rows represent a customer.
joined.shape
# return the top 5 countries from which customers are from by
# counting the frequency of each country of origin and plotting a horizontal bar graph
joined['country_of_origin'].value_counts()[:5].plot(kind='barh')
In the period between 01/12/2010 and 09/12/2011 the website served 4,372 customers. Additionally, even though the website sells internationally, the majority of their customers placed orders from the United Kingdom. This could be due to high international shipping fees or poor marketing presence in foreign countries. International expansion could be an opportunity of profit for the company.
#check how many unique order numbers are in the original dataset "data"
data['InvoiceNo'].unique().shape
#check from how many countries orders are placed from
# select country column unique values and return the shape of the table.
joined['country_of_origin'].unique().shape
#code returns the first 5 coutries from which orders are placed from.
data['Country'].value_counts()[:5].plot(kind='barh')
Similarly to the customer-level summaries, from these simple descriptive statistics we understand that:
%%time
# make a copy of the dataset
data_1 = data
# split date variable in "year", "month", "day", "hour", "minute", "week" to make it easier to plot
data_1['Date'] = data['InvoiceDate'].dt.date
data_1['Day'] = data['InvoiceDate'].dt.day
data_1['Month'] = data['InvoiceDate'].dt.month
data_1['Year'] = data['InvoiceDate'].dt.year
data_1['Hour'] = data['InvoiceDate'].dt.hour
data_1['Week'] = data['InvoiceDate'].dt.week
data_1['Minute'] = data['InvoiceDate'].dt.minute
#set up graph
fig, ax = plt.subplots(2,1, figsize=(9, 9))
### create a dataframe "sales" which is grouped by date and shows the sum of all transaction spending in one month.
## subset data_1 to include year, month and transaction spending.
## group the subset by year and month and sum transaction spending.
sales = data_1[['Year', 'Month', 'Tra_Spending']].groupby(['Year', 'Month']).sum().reset_index()
#create a variable called 'Date' which contains year, month and day
#set day to be one to stand for the start of the month.
sales['Day'] = 1
#convert 'Date' to pd datetime object
sales['Date'] = pd.to_datetime(sales[['Year', 'Month', 'Day']])
#set it as the index of the dataframe
sales = sales.set_index('Date')
#delete year, month and day columns
sales = sales.drop(['Day', 'Month', 'Year'], axis=1)
#create a plot of sales.
ax[0].plot(sales)
#set the title to 'Monthly revenue'
ax[0].set_title('Monthly Revenue')
### Create a dataframe "top_country"
## subset data_1 to contain transaction spending and country
# group by country and sum the transaction spending
top_country = data_1[['Tra_Spending', 'Country']]\
.groupby(['Country']).sum().reset_index().sort_values\
(by='Tra_Spending', ascending=False)['Country'][1:5]
#crate a loop that itarates for each country in the dataframe top_country
for c in top_country:
sales = data_1[data_1['Country'] == c]
#subgroup sales and group by year and month.
sales = sales[['Year', 'Month', 'Tra_Spending']].groupby(['Year', 'Month']).sum().reset_index()
#set day to 1
sales['Day'] = 1
#create date with day, month and year
sales['Date'] = pd.to_datetime(sales[['Day', 'Month', 'Year']])
#set index of sales to be 'Date'
sales = sales.set_index('Date')
#delete columns year, month, and day
sales = sales.drop(['Year', 'Month', 'Day'], axis=1)
#plot
ax[1].plot(sales.cumsum(), label=c)
#this line adds a legend to the graph
ax[1].legend()
#this line sets the title of the graph
ax[1].set_title('Top countries by revenue')
plt.show()
The Monthly Revenue graph show an upward trend between December 2010 and November 2011. The graph then shows a drop in sales in December 2011 however this is given by the fact that the total revenue of the month of December 2011 includes only the revenue of the first 9 days of the month.
The second graph shows the revenue by countries. The graph shows that the country that brings the highest monthly revenue is the Netherlands, followed by the republic of Ireland, Germany, and France. This is interesting because even though the vast majority of orders are placed from the United Kingdom, the latter is missing from the list of top 4 countries that bring the most revenue.
To explore the relationships betweeen the variables of the dataset I ran the sns.pairplot()
function which plot pairwise relationships in a dataset.
By default, sns.pairplot()
only plot numerical columns, however I have used the categorical variable "country_of_origin" to color code the plots. The fuction generates two basic figures, histograms and scatter plots. The histograms on the diagonal show us the distribution of a single variable, while the scatter plots show us the relationship between a pair of variables (Koehrsen, 2018).
In this case, the pairplot is showing us that the variable "days_as_customer" is the only containing varied values, and that the most visible positive relationship between a pair of variables is between "total_quantity" and "total_spending". In fact, as expected, as the quantity purchased increases, the total spending also increases. Moreover, the colors of the plot, show us that the majority of customers are from the UK.
#pairplot gives you a scattered plot of all pairs of variables.
# hue = country of origin colors each observation differently depending
# on their country of origin
sns.pairplot(joined, kind='scatter', hue= 'country_of_origin');
Since we are comparing variables measured with different units, for example, "total_quantity" is expressed in number of orders while "total_spending" is expressed in GBP, I have decided to scale the dataset, and run sns.pairplot()
on the scaled dataset to see if any difference can be observed.
Scaling or standardising a dataset means subtracting the mean and dividing by the standard deviation each observation. This gives all variables the properties of a standard normal distribution with μ=0 and σ=1 (Raschka,2014).
# scale dataset
#select numerical variables to scale
num_cols= ['total_quantity', 'total_spending',\
'total_cancelled_orders', 'total_orders_placed',\
'average_price_of_items', 'days_as_customer']
#subset dataset to contain only numerical variables
joined_toscale = joined[num_cols]
#scale dataset
joined_scaled = pd.DataFrame(preprocessing.scale(joined_toscale),
index=joined_toscale.index,
columns=joined_toscale.columns)
#add country of origin variable for coloring of pairplot
joined_scaled["country_of_origin"]= joined["country_of_origin"]
joined_scaled.head()
# new pairplot
sns.pairplot(joined_scaled, kind='scatter', hue= 'country_of_origin');
No change can be noticed in the pairplots from scaling the data.
Since the aim of the project is to find segments of customers based on their shopping habits, k-means cluster analysis was chosen as a method of analysis.
Clustering is the process of grouping observations into classes so that objects within one class are highly similar and are highly dissimilar to objects in another class (Ye et. al, 2013). Cluster-based approaches have been widely used since the 1970s to perform market segmentation. Their popularity led to the development of noumerous clustering algorithms. Among them, the k-means algorithm is one of the most widely used in marketing research to divide objects into classes due to its ability of processing larger amounts of data (Chen et. al, 2004).
The k-means algorithm works as follows:
Since observations are assigned to a cluster depending on their euclidean distance to a cluster centroid, categorcal values, such as "country_of_origin" cannot be processed with the k-means clustering algorithm. For this reason, I am creating a dataset that does not contain "country_of_origin" as a variable and on which I can perform k-means clustering. Moreove, as it is not ideal for the k-means algorithm to depend on an arbitrary variable unit, the data used for clustering will also be scaled following the same procedure explained above.
# create cluster data with that contains numeric value
cluster_data_scaled = joined_scaled.loc[:, joined_scaled.columns != 'country_of_origin']
cluster_data_scaled.head(3)
Scaling the data, makes sure that all variables have the same standard deviation.
# check the mean and standard deviation of all variables.
cluster_data_scaled.describe().loc[['mean', 'std'], :]
#now all variables have the same standard deviation.
First step of the k-means clustering algorithm is deciding the number of clusters we want the observations to be divided into. There are several methods that help in deciding which "k" values is most suited to the data.
One method that determines an appropriate "k" value, is the "Elbow Method". The elbow method works by running k-means clustering for a range of values of "k" (from 1 to 10), and for each value of "k", it calculates the sum of squared errors (SSE).
The method then plots SSEs for each value of "k". From the plot we want to select the smallest value of "k" that still has a small SSE (Gove, 2017).
# ELBOW METHOD #
# Data to run in the loop
X = cluster_data_scaled
# create empty list to store SSEs for different k means clustering
SSE = []
# for loop runs on range between 2 and 10
for k in range(2, 10):
# set up k means for each k between 2 and 10
kmeans = cluster.KMeans(n_clusters=k)
# fit k means in data
kmeans.fit(X)
# append SSE values into empty list
SSE.append(kmeans.inertia_)
# set up elbow graph
fig = plt.figure(figsize=(7, 5))
# plot values
plt.plot(range(2, 10), SSE)
# add grid
plt.grid(True)
# add title
plt.title('Elbow curve')
In this case, I am selecting k = 6
# set up clustering analysis with k=6
k6 = cluster.KMeans(n_clusters=6)
k6
# fit clustering model on dataset created for clustering
np.random.seed(1234)
k6cl = k6.fit(cluster_data_scaled)
#check labels
labels=k6cl.labels_
labels
# count how many observations each cluster has.
labels_tb = pd.DataFrame(joined.groupby(k6cl.labels_).size())
#rename column
labels_tb.rename(columns={0:'No Customers per Cluster'}, inplace=True)
labels_tb
#create table with mean values per cluster
table_of_means=joined.groupby(k6cl.labels_).mean()
#add average spend per order by dividind average total spending by average n of orders placed
table_of_means['average_spend_order']=\
table_of_means['total_spending']/table_of_means['total_orders_placed']
# look at mean of the clusters
# .T transpose the table to make it easier to look at
table_of_means.T
K-means cluster analysis divided the 4,328 observations contained in the dataset into 6 clusters. From the labels table it is possible to observe that the majority of observations are contained in cluster 1 and 4, while the rest are mainly contained in cluster 0.
Looking at the mean of each cluster, I am dividing customers into three main segments.
This section is dedicated to using multiple linear regression to predict the nubmer of orders placed by each customer from a variety of different predictors.
First, I created a baseline regression model. Second, I improved the the baseline model by scaling the predictive variables and by adding a categorical predictor. Third, I used the random forest algorithm to further improve the model. Lastly, I assess the peformance of each regression model with and without cross-validation.
Linear regression aims at finding linear relationships between two or more continuous variables. A multiple linear regression model can be summarised by the following equation:
where:
(Swaminathan, 2018)
# recall data for regression
joined.head()
My baseline model aims at predicting the number of total orders placed by a customer from:
#regression model
m1_baseline = sm.ols('total_orders_placed ~ total_cancelled_orders\
+ average_price_of_items + total_quantity',\
joined).fit()
m1_baseline.summary()
From the summary of the baseline model I can infer that:
The first improvement I want to implement to the baseline model, is to scale the predictive variables. This step is advisable becasue the three variables are expressed in three different units (i.e. number of orders, GBP and number of items). Scaling the variables makes sure that they are all centred around 0 and will have the same standard deviation.
# Scale the predicting variables because they are on different scales
# (the first one is in orders, and the second one is in pounds.)
# create a list that includes the columns to scale
cols = ['total_cancelled_orders', 'average_price_of_items', 'total_quantity']
#scale columns by subtracting mean and dividing by standard deviation
scaled_for_reg = pd.DataFrame(scale(joined[cols]),
index=joined.index,
columns=cols)
#add column with variable to predict
scaled_for_reg['total_orders_placed'] = joined['total_orders_placed']
#this is the second model with the scaled columns.
m2_scaled = sm.ols('total_orders_placed ~ total_cancelled_orders\
+ average_price_of_items + total_quantity',\
scaled_for_reg).fit()
m2_scaled.summary()
From the summary of the scaled, improved model I can infer that:
Since the aim of this regression model is to predict future number of orders places by a potential customer, it is important to look at the distribution of actual "total_orders_placed" VS the distribution of "total_orders_placed" predicted by the baseline model, and by the scaled baseline model.
To do so, I am using Seaborn's kdeplot()
fuction which gives a smooth estimate of the distribution using kernel density estimation (VanderPlas, 2016). Using kdeplot()
I plotted the distribution of actual values, the distribution of the values predicted by the baseline model, and the distribution of the values predicted by the scaled model. Finally, I have plotted the three curves in the same graph to assess if the regression models are "picking up" the overall shape of the data.
#Distribution of actual number of oders
sns.kdeplot(joined['total_orders_placed'], shade=True, color = 'black')
#distribution of predicted numbers of orders
sns.kdeplot(m1_baseline.fittedvalues, shade=True, color='red')
#distribution of predicted numbers of orders second prediction
sns.kdeplot(m2_scaled.fittedvalues, shade=True, color='green')
#make put all in the same graph
f, ax = plt.subplots(1, figsize=(10, 6))
sns.kdeplot(joined['total_orders_placed'], shade=True, ax=ax, label='actual', color= 'black')
sns.kdeplot(m2_scaled.fittedvalues, shade=True, ax=ax, label='scaled', color = 'orange')
sns.kdeplot(m1_baseline.fittedvalues, shade=True, ax=ax, label='baseline', color = 'red')
plt.show()
From the graph above, it is possible to understand that scaled and baseline model's fitted values (red and orange graphs) have the same distribution and that they one captures the overall shape of the actual values (black graph), however the values are overestimated by the regression models.
To get a better distribution match, I am going to further improve the model by adding a categorical variable as a predictor in the new model.
Categorical variables can be added to regression models as dummy variables. Python's sm.ols()
automatically recognise string variables and treats them as categorical.
For this project, I decided to create a variable "continent_of_origin" and use it as a predictor in the regression model. Since the majority of customers are based in europe, I have decided to sub-divide european countries according to european area (i.e. north, south, east and west europe). The classification of the countries was made according to the United Nation's official geographic regions composition (UN, n.d.).
# add "country_of_origin" variable to scaled dataset
scaled_for_reg['country_of_origin'] = joined['country_of_origin']
#create lists of continents.
north_europe=['Iceland','Norway','Denmark','Sweden','Finland','Lithuania',\
'EIRE','Channel Islands','United Kingdom']
east_europe=['Poland','Czech Republic']
south_europe=['Greece','Italy', 'Malta','Spain','Portugal',]
west_europe=['Switzerland', 'Austria','Belgium', 'Netherlands']
asia=['Cyprus','Bahrain','Israel','Saudi Arabia','United Arab Emirates','Singapore',\
'Japan','Lebanon']
australia=['Australia']
africa=['RSA']
america=['USA','Brazil','Canada']
# Code country of origin by continent
# Create a function that will assign "unspecified"to all variables unless
# The content of the row is in the list declared above.
# Declare function
def continent(row):
#assing all rows to unspecified
value = "unspecified"
# but, if country is contained in the list north_europe,
if row['country_of_origin'] in north_europe:
# give value europe and so on...
value = 'north europe'
if row['country_of_origin'] in south_europe:
value = 'south europe'
if row['country_of_origin'] in east_europe:
value = 'east europe'
if row['country_of_origin'] in west_europe:
value = 'west europe'
if row['country_of_origin'] in asia:
value ='asia'
if row['country_of_origin'] in australia:
value='australia'
if row['country_of_origin'] in africa:
value = 'africa'
if row['country_of_origin'] in america:
value= 'america'
#at the end, return value
return value
# Create a new column 'continent_of_origin' and apply the function.
scaled_for_reg['continent_of_origin'] = scaled_for_reg.apply(continent, axis=1)
scaled_for_reg.head()
# create new model
m3_improved = sm.ols('total_orders_placed ~ total_cancelled_orders\
+ average_price_of_items + continent_of_origin + total_quantity', scaled_for_reg ).fit()
# look at model summary
m3_improved.summary()
From the summary of the model containing country of origin as a predictor, I can infer that:
# graph baseline, scaled and categorical model
f, ax = plt.subplots(1, figsize=(15, 5))
# first plot
sns.kdeplot(joined['total_orders_placed'], shade=True, ax=ax,\
label='actual', color = 'black')
# second plot
sns.kdeplot(m2_scaled.fittedvalues, shade=True, ax=ax, \
label='scaled', color= 'orange')
# third plot
sns.kdeplot(m3_improved.fittedvalues, shade=True, ax=ax,\
label='with categorical', color='green')
plt.show()
The latest model, does not seem to improve the prediction, as the distribution curve has remained unchanged from the one of the scaled model.
Next, I will explore the models' performance, to have a better idea of their predictive power.
To assess the model performance, I will compare the R squared, Mean Square Error (MSE), and Mean Absolute Error (MAE) of the baseline model with the measures of the latest model.
The MAE measures the average error in absolute terms of a set of forecasts. The MSE measures the average of the squares of the error in a set of forecasts. Along with R squared, MAE and MSE are generally used as a measure of model performance.
#R^2
r2 = pd.Series({'Baseline Model': metrics.r2_score(joined['total_orders_placed'],
m1_baseline.fittedvalues),
'Latest Model': metrics.r2_score(scaled_for_reg['total_orders_placed'],
m3_improved.fittedvalues)})
r2
# MSE
mse = pd.Series({'Baseline Model': metrics.mean_squared_error(joined['total_orders_placed'],
m1_baseline.fittedvalues),
'Latest Model': metrics.mean_squared_error(scaled_for_reg['total_orders_placed'],
m3_improved.fittedvalues)})
# MAE
# taking the MAE between actual values and the values forecasted by the baseline model
mae = pd.Series({'Baseline Model': metrics.mean_absolute_error(joined['total_orders_placed'],
m1_baseline.fittedvalues),
#taking the MAE between actual values and the values forecasted by m3 model
'Latest Model': metrics.mean_absolute_error(scaled_for_reg['total_orders_placed'],
m3_improved.fittedvalues)})
perf = pd.DataFrame({'MAE': mae,
'MSE': mse,
'R^2': r2})
perf
From the comparison of MAE, MSE and R squared, It is important to note that R squared and MSE slightly improved from the baseline model and latest model, while MAE resulted higher in the latest model.
Since the aim of the regression model is to make predictions, it is appropriate to assess the validity of the model with cross-validation. Cross-validation is a validation technique for assessing how a model will perfom on a different and independent data set. The first step to cross-validation is to split the data set into train set and test set. The forecasting model is then trained on the train set and consequently validated on the test set. (Seni and Elder, 2010).
# Split dataset into train (80%) and test set (20%)
# names of predictive variables
cols = ['total_cancelled_orders', 'average_price_of_items',\
'total_quantity']
# create test and train sets
x_train, x_test, y_train, y_test =\
model_selection.train_test_split(scaled_for_reg[cols],
joined['total_orders_placed'],
test_size=0.8)
# Train data on train set
# f is a variable that includes the regression equation
f = 'total_orders_placed ~ total_cancelled_orders + average_price_of_items + total_quantity'
# this is the regression model
m1_train = sm.ols(f, x_train.assign(total_orders_placed=y_train)).fit()
# create data frame containin the parameters of full set and of train set
pd.DataFrame({'Full Dataset': m1_baseline.params,
'Train Set': m1_train.params})
# explore r quared of full data set and of train set
pd.Series({'Full Dataset': m1_baseline.rsquared,
'Train Set': m1_train.rsquared})
# make predictions on test set.
test_prediction = m1_train.params['Intercept'] + \
x_test.dot(m1_train.params.drop('Intercept'))
#create a series with the r squared of baseline, train set data and test set data
pd.Series({'0-Full Dataset': m1_baseline.rsquared,
'1-Train Set': m1_train.rsquared,
'2-Test Set': metrics.r2_score(y_test,
test_prediction)})
From the comparisons of R Squared values of the full data, train data set and test data, it is possible to note that R squared increases from the full data to the train data, but it decreases, and becomes negative, for the test data.
This is an example of "overfitting". Overfitting happens when a model describes the random error in the data rather than the relationships between variables (Frost, 2017). This means that the model created is a good fit to make preditions in the specific dataset used for analysis (R squared is 0.43), however it has no forecasting power over any other independent data set (R squared is -0.37).
The Random Forest is a machine learning algorithm that is used on classification or regression problems. It is one of the most used algorithms because of its semplicity and effectiveness.
# run random forest regression. we can fit and predict in the same line
m4_RandomF = RandomForestRegressor().fit(joined[cols], joined['total_orders_placed'])\
.predict(joined[cols])
# examine MAE MSE and R suqared of FR
rf = pd.Series({'R^2': metrics.r2_score(joined['total_orders_placed'], m4_RandomF),
'MSE': metrics.mean_squared_error(joined['total_orders_placed'], m4_RandomF),
'MAE': metrics.mean_absolute_error(joined['total_orders_placed'], m4_RandomF)})
# add RF performance indicators to the table of baseline and m3 model
pd.concat([perf,
pd.DataFrame({'Random Forest': rf}).T])
From the table above, it is important to note that the Random Forest regressor performs considerably better than the baseline model and the latest model with scaled variables and a categorical predictor.
Its MAE went from 68 to 22, MSE went from 30,000 to 6,000 and finally, its R squared value is 0.90, meaning that the model picks up almost 90% of the variation of the predicted variable.
Just as I did for the previous model, I am going to assess the performace of the Random Forest model with cross-validation to see if this model can be used to make predictions.
# previously the model was fitted on the same dataset where the predictions were made.
# in this case, we are making the predictions on the text set
# random forest cross validation.
m4_cv = RandomForestRegressor().fit(x_train, y_train).predict(x_test)
# get MSE, MAE e R2 for cross validated RF
rf_cv = pd.Series({'R^2': metrics.r2_score(y_test, m4_cv),
'MSE': metrics.mean_squared_error(y_test, m4_cv),
'MAE': metrics.mean_absolute_error(y_test, m4_cv)})
pd.concat([perf,
pd.DataFrame({'RF': rf}).T,
pd.DataFrame({'RF-CV': rf_cv}).T])
As expected, the cross-validated Random Forest regressor performance are much worst compared to the non-cross-validated performance measures. This once again, is a sign of overfitting and suggests that the regression model has little predictive power over an independent dataset.
In conlusion, this project was carried out using a dataset containing 541,909 data points corresponding to e-commerce transactions of an online UK retailer. The aim was to segment customers and to create a regression model that could forecast the number of orders placed by each customer.
To carry out the project, I first pre-processed the data by inspecting it, transforming existing variables, creating new variables, dealing with missing values, grouping the data at customer level and finally performing descriptive statistics and data visualization. The second step was then to segment customers into categories, to do so, I have used k-means cluster analysis. From this, I was able to divide customers into three main classes:
After performing clustering analysis, I have modeled a multiple linear regression with the aim of forecasting the number of orders placed by each customers. I first create a baseline model, which was then improved by scaling the predictive variables and by adding a categorical variable as predictor. Model performance was assessed with and without cross-validation and this showed strong signs of overfitting. To overcome this problem and to improve the predictive power of the model, I have used the Random Forest algorithm and assessed its performance with and without cross-validation. Once again, the performance indicators suggested that the regression model has little predictive power over an independent dataset.
To solve this problem, for future research it is advisable to join the current dataset with information about the demographics and shopping habits of each customer. This will add valuable information that can be used by the clustering algorithm to further segment customers and by the regression model to spot meaningful relationships within the variables in the data.
Archive.ics.uci.edu. (2015). UCI Machine Learning Repository: Online Retail Data Set. [online] Available at: http://archive.ics.uci.edu/ml/datasets/online+retail# [Accessed 14 May 2018].
Chen, J., Ching, R. and Lin, Y. (2004). An extended study of the K-means algorithm for data clustering and its applications. Journal of the Operational Research Society, 55(9), pp.976-987.
Frost, J. (2017). Overfitting Regression Models: Problems, Detection, and Avoidance - Statistics By Jim. [online] Statistics By Jim. Available at: http://statisticsbyjim.com/regression/overfitting-regression-models/ [Accessed 14 May 2018].
Gove, A. (2017). Using the elbow method to determine the optimal number of clusters for k-means clustering. [online] Bl.ocks.org. Available at: https://bl.ocks.org/rpgove/0060ff3b656618e9136b [Accessed 14 May 2018].
Kaggle.com. (2018). E-Commerce Data | Kaggle. [online] Available at: https://www.kaggle.com/carrie1/ecommerce-data [Accessed 14 May 2018]. Trochim 2006 (https://www.socialresearchmethods.net/kb/statdesc.htm)
Koehrsen, W. (2018). Visualizing Data with Pairs Plots in Python – Towards Data Science. [online] Towards Data Science. Available at: https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166 [Accessed 14 May 2018].
Raschka, S. (2014). About Feature Scaling and Normalization. [online] Dr. Sebastian Raschka. Available at: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html [Accessed 14 May 2018].
Seni, G. and Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1), pp.1-126.
Swaminathan, S. (2018). Linear Regression — Detailed View – Towards Data Science. [online] Towards Data Science. Available at: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86 [Accessed 14 May 2018].
UN (2018). UNSD — Methodology. [online] Unstats.un.org. Available at: https://unstats.un.org/unsd/methodology/m49/ [Accessed 14 May 2018].
VanderPlas, J. (2016). Python data science handbook. 1st ed.
Ye, L., Qiuru, C., Haixu, X., Yijun, L. and Guangping, Z. (2013). Customer Segmentation for Telecom with the k-means Clustering Method. Information Technology Journal, 12(3), pp.409-413.