Customer Segmentation

1.0 Introduction

The aim of this project is to perform customer segmentation on a dataset of e-commerce transactions. After dividing customers into segments, linear regression is performed to predict the number of orders placed by each customer depending on a variety of different variables.

The data in question is an open source dataset created by the UCI Machine Learning Repository containing 541,909 E-commerce transactions dated from 01/12/2010 until 09/12/2011 (UCI, 2018).

I have downloaded the dataset from Kaggle.com, a platform for predictive modelling and analytics competitions which stores open source datasets uploaded by companies or users (Kaggle, 2018).

The data set contains transactional data of a UK-based online retailer that sells unique all occasion gifts to wholesales and end users. The data set variabled are explained below:

  • InvoiceNo: Nominal. Transaction unique identifier. If this code starts with the letter 'C', the order was cancelled
  • StockCode: Nominal. Product unique identifier
  • Description: Nominal. Product name
  • Quantity: Numeric. The quantities of each product per transactio.
  • InvoiceDate: Numeric. Invice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Numeric. Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Nominal. Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Nominal. Country name. Nominal, the name of the country where each customer resides.

The project is divided into four main sections:

  1. Introduction
  2. Data pre-processing
  3. Data visualization
  4. Data analysis and discussion of results
  5. Conclusion and recommendations for future work

1.1 Download Data

The data was downloaded and unzipped manually in .csv format in the working directory of the project.

The data set contained characters that were not encoded as UTF-8 therefore, it was opened in a text editor and encoded to UTF-8 before reading it into jupyter notebook.

1.2 Import Pyton Libraries

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import pysal as ps
import geopandas as gpd
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm

1.3 Read Data on Jupyter Notebook

In [3]:
#this line reads .csv files into the script
data = pd.read_csv('data.csv')

1.4 Inspect the data

Before pre-processing and performing analysis, I explored the data with descriptive statistics. Descriptive statistics is used to describe the basic features of the dataset we want to analyse and, together with data visualization, it allows us to understand the shape and general attributes of our data (Trochim, 2006).

In [4]:
#look at first 10 rows of the dataset 
data.head(10)
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 12/1/2010 8:26 7.65 17850.0 United Kingdom
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 12/1/2010 8:26 4.25 17850.0 United Kingdom
7 536366 22633 HAND WARMER UNION JACK 6 12/1/2010 8:28 1.85 17850.0 United Kingdom
8 536366 22632 HAND WARMER RED POLKA DOT 6 12/1/2010 8:28 1.85 17850.0 United Kingdom
9 536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 12/1/2010 8:34 1.69 13047.0 United Kingdom

1.4.1 Summary Statistics

In [5]:
#describe() returs count, mean, standard deviation, minimun value and maximum value
# of the numerical variables of the dataset
data.describe()
Out[5]:
Quantity UnitPrice CustomerID
count 541909.000000 541909.000000 406829.000000
mean 9.552250 4.611114 15287.690570
std 218.081158 96.759853 1713.600303
min -80995.000000 -11062.060000 12346.000000
25% 1.000000 1.250000 13953.000000
50% 3.000000 2.080000 15152.000000
75% 10.000000 4.130000 16791.000000
max 80995.000000 38970.000000 18287.000000

From this first glimpse at the data it is important to note that:

  • roughly 14,000 unique customers IDs are missing from the dataset
  • there are negative values of unit prices and quantity of products purchased
  • on average, customer buy 9.5 products at each transaction and spend £ 4.6

2.0 Data Pre-Processing

2.1 Assign data to the right type

Assigning data to the right type makes sure that, when performing analysis, calculations are consistent, and data is processed in the right way.

To check to what type the variables are assigned to by default, I am using the .info() function.

In [6]:
#get information about the dataset and what type each variable is assigned to by default
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

From this, I am choosing to format the “InvoiceDate” column to a datetime variable, and the “Country” column to a categorical variable.

In [7]:
%%time
#make date format consistent 
data['InvoiceDate']=pd.to_datetime((data['InvoiceDate']))
CPU times: user 1min 40s, sys: 98 ms, total: 1min 40s
Wall time: 1min 40s
In [8]:
#make country a categorical variable
data["Country"] = data["Country"].astype('category')

2.2 Missing Values

As the summary statistics suggested, data is missing from the data set. For this reason, I decided to create a function that returns a table with the number of missing observations per variable.

In [9]:
## This function counts and creates a dataframe containing the nuber of missing values per column.

# Declare the name of the fuction 
def count_missing(data):
    # creates a dataframe containing the sum of all missing values
    missing_count = pd.DataFrame(data.isnull().sum(),\
                                 # names column "Count" and sorts values in descending order
                                 columns=['Count']).sort_values(by=['Count'],\
                                                                ascending=False)
    return missing_count

#apply the function to the data 
count_missing(data)
Out[9]:
Count
CustomerID 135080
Description 1454
InvoiceNo 0
StockCode 0
Quantity 0
InvoiceDate 0
UnitPrice 0
Country 0

As expected, 13,5080 CustomerIDs are missing from the dataset as well as 1,454 product descriptions.

2.2.1 Dealing With Missing Values

To deal with missing values, I decided to fill each blank description cell with “Missing”, and to discard all the rows with a missing CustomerID. This decision was made because the aim of this project is to explore the data at customer level, therefore all observations that are not assigned to a particular customer cannot be used for analysis.

In [10]:
# replace missing descriptions with "Missing"
data["Description"] = data["Description"].fillna("Missing")
In [11]:
#delete all rows with a missing "CustomerID"
data.dropna(axis = 0, subset = ['CustomerID'], inplace = True)

#check that there aren't any missing values
count_missing(data)
Out[11]:
Count
InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0

2.3 Create New Variables

2.3.1 Canceled Orders

Create a new varialble 'Cancel'

As explained in the introduction of this report, cancelled orders are the ones whose “InvoiceNo” start with a “C”. This allowed me to create a binomial variable that indicates whether an order was cancelled or not. The new variable “Cancel” takes the value of 1 of the order was cancelled, or the value of 0 if the order was not cancelled.

To do so, I have created a function named cancel() which assigns the value 0 to all rows, however, if the first character of each row in the column "InvoiceNo" is "C", it assigns the value of 1

In [12]:
%%time
#create a function that will assign 0 to all variables unless the first character of 'InvoiceNo' in 'C'
def cancel(row):
    value = 0
    if row['InvoiceNo'][0] == 'C':
        value = 1
    return value

# Create a new column 'Cancel' to attach to 'data' and set it to the value returned 
  #by the function cancel().
    
# The code 'axis=1' makes the apply function process the dataset by row, 
  #as opposed to by column which is the default option.
data['Cancel'] = data.apply(cancel, axis=1)

# the %%time code returns how much time is needed to run this function. 
CPU times: user 30.8 s, sys: 122 ms, total: 30.9 s
Wall time: 30.9 s
In [13]:
#check that there are some cancelled orders and that the new column is in the data
data[data['Cancel']== 1].head(3)
Out[13]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Cancel
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom 1
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311.0 United Kingdom 1
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548.0 United Kingdom 1

To better understand the frequency of cancelled orders, I calculated the probability of an order to be cancelled, and visualised how many orders were cancelled in the past year with a vertical bar graph.

In [25]:
#check how many cancelled VS non-cancelled orders there are. 
canceled = data['Cancel'].value_counts()

# rename the axis so that they show in the picture
canceled = canceled.rename({0: 'Non-canceled', 1: 'Canceled'})
In [24]:
# bar graph of canceled vs non-canceled orders 
canceled.plot.barh();
In [86]:
#calculate the probability of an order to be cancelled.
pco = (8905/(397924+8905))
pco
Out[86]:
0.021888803403887137

During the past year, 8905 orders were cancelled, this are 2.2% of all orders placed.

2.3.2 Transaction Spending

To continue, I have decided to calculate the total spending of each transaction by multiplying “Quantity” by “UnitPrice”. I then attached the result to a new variable called “Tra_Spending”.

In [87]:
data['Tra_Spending']= (data['Quantity'])*(data['UnitPrice'])
data.head(3)
Out[87]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Cancel Tra_Spending
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 0 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 0 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 0 22.00

2.4 Grouping data by "CustomerID"

The next step is to group all observation by customerID. This will make it easier to explore customers characteristics. While doing so, I created a variable called "days_as_customer" by taking the difference in days between the date of a customer's last order, and the date of their first order.

2.4.1 Sum "Quantity" , "Tra_Spending" and "Cancel"

In [88]:
#group dataset by cystomerID to start creating customer profiles. 
#The following code groups the data by customerID and sum quantity, transaction spending, and number of cancelled orders
aggQ=data.groupby('CustomerID')['Quantity','Tra_Spending', 'Cancel'].sum()

aggQ.head(3)
Out[88]:
Quantity Tra_Spending Cancel
CustomerID
12346.0 0 0.00 1
12347.0 2458 4310.00 0
12348.0 2341 1797.24 0

2.4.2 Count "InvoiveNo"

In [89]:
#group by custmer and count unique invoiceNo to see how many oder the customer has placed in the time the data spans.
Inv = data.groupby('CustomerID')['InvoiceNo'].count()
Inv.head(3)
Out[89]:
CustomerID
12346.0      2
12347.0    182
12348.0     31
Name: InvoiceNo, dtype: int64

2.4.3 Create "days_as_customer"

In [90]:
#### Create a varible that shows how much time a person has been a customer.

#group by customerID and select the mininum date (i.e. the date at which the first order was placed)
#as_index=false means that customer ID is a column and not the index
days= data.groupby('CustomerID', as_index=False)['InvoiceDate'].min()

#edit names of columns 
days.columns = ['CustomerID', 'Earliest_Invoice']

#create a column with the date of the latest order placed
days['Latest_Invoice'] = pd.to_datetime((data['InvoiceDate']).max())
days.set_index('CustomerID')
days.head(3)
Out[90]:
CustomerID Earliest_Invoice Latest_Invoice
0 12346.0 2011-01-18 10:01:00 2011-12-09 12:50:00
1 12347.0 2010-12-07 14:57:00 2011-12-09 12:50:00
2 12348.0 2010-12-16 19:09:00 2011-12-09 12:50:00
In [91]:
#calculate how many days the person has been a customer by subtracting the date of the latest invoice
# to the date of the first invoice
days['days_as_customer'] = 1 + (days.Latest_Invoice-days.Earliest_Invoice).astype('timedelta64[D]')

#delete earliest invoice and laterst invoice columns
days.drop(['Earliest_Invoice', 'Latest_Invoice'], axis=1, inplace=True)
days.head()
Out[91]:
CustomerID days_as_customer
0 12346.0 326.0
1 12347.0 367.0
2 12348.0 358.0
3 12349.0 19.0
4 12350.0 310.0

2.4.4 Add "Country" of customer

In [92]:
#grouped data by customerID and pasted the value of 'Country' 
country = data.groupby('CustomerID')['Country'].first()
country.head(3)
Out[92]:
CustomerID
12346.0    United Kingdom
12347.0           Iceland
12348.0           Finland
Name: Country, dtype: object

2.4.5 Average "UnitPrice"

In [93]:
#group by 'customerID' and average unitprice to see if the person buys expensive items or not
meanex = data.groupby('CustomerID')['UnitPrice'].mean()
meanex.head(3)
Out[93]:
CustomerID
12346.0    1.040000
12347.0    2.644011
12348.0    5.764839
Name: UnitPrice, dtype: float64

2.4.6 Join data together to create new dataset

In [94]:
# create list of datafames to join that have the same index
to_join=[Inv, country, meanex]
# use join() to join aggQ with the three dataframes listed above
joined = aggQ.join(to_join)

joined = joined.join(days.set_index('CustomerID'))

# edit column names
joined.columns = ['total_quantity', 'total_spending',\
                    'total_cancelled_orders', 'total_orders_placed',\
                    'country_of_origin', 'average_price_of_items', 'days_as_customer']
In [95]:
#delete all negative values
joined = joined.drop(joined[(joined.total_quantity < 0)].index)
joined = joined.drop(joined[(joined.total_spending < 0)].index)

The new dataset contains the following variables:

  • CustomerID: customer unique identifier
  • total_quantity: total number of items purchased by customer
  • total_spending: total amount spent by the customer
  • total_cancelled_orders: total number of orders cancelled by the customer
  • total_orders_placed: total number of orders placed by the customer
  • country_of_origin: country of origin of the customer
  • average_price_of_items: average price of items purchased
  • days_as_customer: days between first and last orders placed by the customer
In [96]:
# look at first rows of data set
joined.head()
Out[96]:
total_quantity total_spending total_cancelled_orders total_orders_placed country_of_origin average_price_of_items days_as_customer
CustomerID
12346.0 0 0.00 1 2 United Kingdom 1.040000 326.0
12347.0 2458 4310.00 0 182 Iceland 2.644011 367.0
12348.0 2341 1797.24 0 31 Finland 5.764839 358.0
12349.0 631 1757.55 0 73 Italy 8.289041 19.0
12350.0 197 334.40 0 17 Norway 3.841176 310.0

3.0 Data Visualization

I decided to explore the data visually before proceeding with analysis.

How many customers are in the dataset?
In [97]:
# look at the shape of the dataset joined. 
# this gives us the number of customers in the dataset because its rows represent a customer. 
joined.shape
Out[97]:
(4328, 7)
Where are they from?
In [98]:
# return the top 5 countries from which customers are from by 
# counting the frequency of each country of origin and plotting a horizontal bar graph 
joined['country_of_origin'].value_counts()[:5].plot(kind='barh')
Out[98]:
<matplotlib.axes._subplots.AxesSubplot at 0x121db0320>

In the period between 01/12/2010 and 09/12/2011 the website served 4,372 customers. Additionally, even though the website sells internationally, the majority of their customers placed orders from the United Kingdom. This could be due to high international shipping fees or poor marketing presence in foreign countries. International expansion could be an opportunity of profit for the company.

How many orders were placed?
In [99]:
#check how many unique order numbers are in the original dataset "data" 
data['InvoiceNo'].unique().shape
Out[99]:
(22190,)
In [100]:
#check from how many countries orders are placed from 

# select country column unique values and return the shape of the table. 
joined['country_of_origin'].unique().shape
Out[100]:
(37,)
Where are the orders placed from?
In [101]:
#code returns the first 5 coutries from which orders are placed from. 
data['Country'].value_counts()[:5].plot(kind='barh')
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x121d9aeb8>

Similarly to the customer-level summaries, from these simple descriptive statistics we understand that:

  • 22,190 orders were placed in the period between 01/12/2010 and 09/12/2011. This means that, on average, the website received 59.5 orders per day. This is a number to monitor and possibly improve.
  • Orders are placed from 37 counties, however the majority of orders, as expected, are placed from the UK. Again, growing its presence in foreign countries, could be an important opportunity for the business.
Which countries bring the most revenue?
In [102]:
%%time
# make a copy of the dataset
data_1 = data

# split date variable in "year", "month", "day", "hour", "minute", "week" to make it easier to plot
data_1['Date'] = data['InvoiceDate'].dt.date
data_1['Day'] = data['InvoiceDate'].dt.day
data_1['Month'] = data['InvoiceDate'].dt.month
data_1['Year'] = data['InvoiceDate'].dt.year
data_1['Hour'] = data['InvoiceDate'].dt.hour
data_1['Week'] = data['InvoiceDate'].dt.week
data_1['Minute'] = data['InvoiceDate'].dt.minute

#set up graph 
fig, ax = plt.subplots(2,1, figsize=(9, 9))

### create a dataframe "sales" which is grouped by date and shows the sum of all transaction spending in one month.

## subset data_1 to include year, month and transaction spending. 
## group the subset by year and month and sum transaction spending. 
sales = data_1[['Year', 'Month', 'Tra_Spending']].groupby(['Year', 'Month']).sum().reset_index()

#create a variable called 'Date' which contains year, month and day
#set day to be one to stand for the start of the month. 
sales['Day'] = 1
#convert 'Date' to pd datetime object
sales['Date'] = pd.to_datetime(sales[['Year', 'Month', 'Day']])

#set it as the index of the dataframe
sales = sales.set_index('Date')
#delete year, month and day columns
sales = sales.drop(['Day', 'Month', 'Year'], axis=1)

#create a plot of sales. 
ax[0].plot(sales)
#set the title to 'Monthly revenue'
ax[0].set_title('Monthly Revenue') 

### Create a dataframe "top_country"
##  subset data_1 to contain transaction spending and country
#   group by country and sum the transaction spending 
top_country = data_1[['Tra_Spending', 'Country']]\
                .groupby(['Country']).sum().reset_index().sort_values\
                (by='Tra_Spending', ascending=False)['Country'][1:5]

#crate a loop that itarates for each country in the dataframe top_country
for c in top_country:
    sales = data_1[data_1['Country'] == c]
    #subgroup sales and group by year and month. 
    sales = sales[['Year', 'Month', 'Tra_Spending']].groupby(['Year', 'Month']).sum().reset_index()
    #set day to 1 
    sales['Day'] = 1
    #create date with day, month and year
    sales['Date'] = pd.to_datetime(sales[['Day', 'Month', 'Year']])
    #set index of sales to be 'Date'
    sales = sales.set_index('Date')
    #delete columns year, month, and day
    sales = sales.drop(['Year', 'Month', 'Day'], axis=1)
    #plot
    ax[1].plot(sales.cumsum(), label=c)    
    #this line adds a legend to the graph
    ax[1].legend()
    #this line sets the title of the graph
    ax[1].set_title('Top countries by revenue')            
    
plt.show()
CPU times: user 2.31 s, sys: 161 ms, total: 2.47 s
Wall time: 2.19 s

The Monthly Revenue graph show an upward trend between December 2010 and November 2011. The graph then shows a drop in sales in December 2011 however this is given by the fact that the total revenue of the month of December 2011 includes only the revenue of the first 9 days of the month.

The second graph shows the revenue by countries. The graph shows that the country that brings the highest monthly revenue is the Netherlands, followed by the republic of Ireland, Germany, and France. This is interesting because even though the vast majority of orders are placed from the United Kingdom, the latter is missing from the list of top 4 countries that bring the most revenue.

4.0 Analysis

To explore the relationships betweeen the variables of the dataset I ran the sns.pairplot() function which plot pairwise relationships in a dataset.

By default, sns.pairplot() only plot numerical columns, however I have used the categorical variable "country_of_origin" to color code the plots. The fuction generates two basic figures, histograms and scatter plots. The histograms on the diagonal show us the distribution of a single variable, while the scatter plots show us the relationship between a pair of variables (Koehrsen, 2018).

In this case, the pairplot is showing us that the variable "days_as_customer" is the only containing varied values, and that the most visible positive relationship between a pair of variables is between "total_quantity" and "total_spending". In fact, as expected, as the quantity purchased increases, the total spending also increases. Moreover, the colors of the plot, show us that the majority of customers are from the UK.

In [103]:
#pairplot gives you a scattered plot of all pairs of variables. 
# hue = country of origin colors each observation differently depending 
  # on their country of origin 
sns.pairplot(joined, kind='scatter', hue= 'country_of_origin');

Since we are comparing variables measured with different units, for example, "total_quantity" is expressed in number of orders while "total_spending" is expressed in GBP, I have decided to scale the dataset, and run sns.pairplot() on the scaled dataset to see if any difference can be observed.

Scaling or standardising a dataset means subtracting the mean and dividing by the standard deviation each observation. This gives all variables the properties of a standard normal distribution with μ=0 and σ=1 (Raschka,2014).

In [104]:
# scale dataset

#select numerical variables to scale
num_cols= ['total_quantity', 'total_spending',\
           'total_cancelled_orders', 'total_orders_placed',\
           'average_price_of_items', 'days_as_customer']

#subset dataset to contain only numerical variables
joined_toscale = joined[num_cols]

#scale dataset
joined_scaled = pd.DataFrame(preprocessing.scale(joined_toscale),
                           index=joined_toscale.index,
                           columns=joined_toscale.columns)

#add country of origin variable for coloring of pairplot
joined_scaled["country_of_origin"]= joined["country_of_origin"]
joined_scaled.head()
Out[104]:
total_quantity total_spending total_cancelled_orders total_orders_placed average_price_of_items days_as_customer country_of_origin
CustomerID
12346.0 -0.241593 -0.232636 -0.138209 -0.393813 -0.039807 0.852451 United Kingdom
12347.0 0.282003 0.289361 -0.274125 0.377212 -0.027119 1.198879 Iceland
12348.0 0.257080 -0.014967 -0.274125 -0.269592 -0.002433 1.122834 Finland
12349.0 -0.107179 -0.019774 -0.274125 -0.089687 0.017533 -1.741534 Italy
12350.0 -0.199628 -0.192136 -0.274125 -0.329561 -0.017650 0.717259 Norway
In [105]:
# new pairplot
sns.pairplot(joined_scaled, kind='scatter', hue= 'country_of_origin');

No change can be noticed in the pairplots from scaling the data.

4.1 K-means Clustering Analysis

Since the aim of the project is to find segments of customers based on their shopping habits, k-means cluster analysis was chosen as a method of analysis.

Clustering is the process of grouping observations into classes so that objects within one class are highly similar and are highly dissimilar to objects in another class (Ye et. al, 2013). Cluster-based approaches have been widely used since the 1970s to perform market segmentation. Their popularity led to the development of noumerous clustering algorithms. Among them, the k-means algorithm is one of the most widely used in marketing research to divide objects into classes due to its ability of processing larger amounts of data (Chen et. al, 2004).

The k-means algorithm works as follows:

  1. K number of clusters is determined
  2. The algorithm randomly determines initial centroids of each cluster
  3. Euclidean distance between remaining observations and cluster centroids is calculated.
  4. Observations are assigned to the cluster with the closest centriod
  5. Mean of each cluster is calculated
  6. For each cluster, if the difference of the mean value and centroid surpasses a threshold, the centroid value is replaced with the cluster mean
  7. Step 3, 4, 5 and 6 are repeated until no centroid value is replaced (Yogyakarta et. al, 2015).

Since observations are assigned to a cluster depending on their euclidean distance to a cluster centroid, categorcal values, such as "country_of_origin" cannot be processed with the k-means clustering algorithm. For this reason, I am creating a dataset that does not contain "country_of_origin" as a variable and on which I can perform k-means clustering. Moreove, as it is not ideal for the k-means algorithm to depend on an arbitrary variable unit, the data used for clustering will also be scaled following the same procedure explained above.

In [106]:
# create cluster data with that contains numeric value
cluster_data_scaled = joined_scaled.loc[:, joined_scaled.columns != 'country_of_origin']
cluster_data_scaled.head(3)
Out[106]:
total_quantity total_spending total_cancelled_orders total_orders_placed average_price_of_items days_as_customer
CustomerID
12346.0 -0.241593 -0.232636 -0.138209 -0.393813 -0.039807 0.852451
12347.0 0.282003 0.289361 -0.274125 0.377212 -0.027119 1.198879
12348.0 0.257080 -0.014967 -0.274125 -0.269592 -0.002433 1.122834

Scaling the data, makes sure that all variables have the same standard deviation.

In [107]:
# check the mean and standard deviation of all variables. 
cluster_data_scaled.describe().loc[['mean', 'std'], :]

#now all variables have the same standard deviation. 
Out[107]:
total_quantity total_spending total_cancelled_orders total_orders_placed average_price_of_items days_as_customer
mean -6.566938e-18 2.462602e-17 -3.940163e-17 -4.104337e-18 8.208673e-19 -3.858076e-17
std 1.000116e+00 1.000116e+00 1.000116e+00 1.000116e+00 1.000116e+00 1.000116e+00

4.1.2 Determine the Value of K - The Elbow Method

First step of the k-means clustering algorithm is deciding the number of clusters we want the observations to be divided into. There are several methods that help in deciding which "k" values is most suited to the data.

One method that determines an appropriate "k" value, is the "Elbow Method". The elbow method works by running k-means clustering for a range of values of "k" (from 1 to 10), and for each value of "k", it calculates the sum of squared errors (SSE).

The method then plots SSEs for each value of "k". From the plot we want to select the smallest value of "k" that still has a small SSE (Gove, 2017).

In [108]:
# ELBOW METHOD # 

# Data to run in the loop 
X = cluster_data_scaled
# create empty list to store SSEs for different k means clustering
SSE = []

# for loop runs on range between 2 and 10
for k in range(2, 10):
    # set up k means for each k between 2 and 10 
    kmeans = cluster.KMeans(n_clusters=k)
    # fit k means in data
    kmeans.fit(X)
    # append SSE values into empty list 
    SSE.append(kmeans.inertia_)

# set up elbow graph 
fig = plt.figure(figsize=(7, 5))
# plot values
plt.plot(range(2, 10), SSE)
# add grid
plt.grid(True)
# add title
plt.title('Elbow curve')
Out[108]:
Text(0.5,1,'Elbow curve')

In this case, I am selecting k = 6

4.1.3 Perform K-means clustering

In [109]:
# set up clustering analysis with k=6
k6 = cluster.KMeans(n_clusters=6)
k6
Out[109]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=6, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [110]:
# fit clustering model on dataset created for clustering
np.random.seed(1234)
k6cl = k6.fit(cluster_data_scaled)
In [111]:
#check labels
labels=k6cl.labels_

labels
Out[111]:
array([4, 4, 4, ..., 1, 4, 4], dtype=int32)
In [112]:
# count how many observations each cluster has. 
labels_tb = pd.DataFrame(joined.groupby(k6cl.labels_).size())

#rename column
labels_tb.rename(columns={0:'No Customers per Cluster'}, inplace=True)

labels_tb
Out[112]:
No Customers per Cluster
0 140
1 1635
2 7
3 5
4 2540
5 1
In [113]:
#create table with mean values per cluster
table_of_means=joined.groupby(k6cl.labels_).mean()
#add average spend per order by dividind average total spending by average n of orders placed
table_of_means['average_spend_order']=\
                    table_of_means['total_spending']/table_of_means['total_orders_placed']
In [114]:
# look at mean of the clusters 

# .T transpose the table to make it easier to look at
table_of_means.T
Out[114]:
0 1 2 3 4 5
total_quantity 8127.071429 426.318654 35328.571429 92827.4000 930.029528 60.00
total_spending 13288.101143 660.322741 59766.425714 192103.8540 1572.361954 649.50
total_cancelled_orders 22.414286 0.544343 104.571429 20.6000 1.521260 2.00
total_orders_placed 452.542857 44.839755 4286.428571 1013.4000 92.447638 5.00
average_price_of_items 4.020325 4.623149 3.841534 6.3574 3.954898 8055.78
days_as_customer 342.285714 90.743731 334.714286 360.0000 304.595669 182.00
average_spend_order 29.363188 14.726279 13.943175 189.5637 17.008136 129.90

4.1.4 Characterise the Clusters

K-means cluster analysis divided the 4,328 observations contained in the dataset into 6 clusters. From the labels table it is possible to observe that the majority of observations are contained in cluster 1 and 4, while the rest are mainly contained in cluster 0.

Looking at the mean of each cluster, I am dividing customers into three main segments.

  1. "The Abituals": Cluster 4 customers. This is the largest cluster in size. Customers in this segment have been purchasing from the e-commerce for almost 1 year. On average, they have spent £1,572 and canceled orders rarely (they cancel on average 1.52 orders out of the 92 placed). Their average spend per order is £17.
  2. "The New-comers": Cluster 1 customers. This is the second largest cluster in size. Customers in this segment have recently discovered the e-commerce. They cancel their orders rarely (0.5 canceled orders out of 45 placed) and have spent on average £660 in the last year , which amounts for £15 per order.
  3. "The Big Spenders": Cluster 0 customers. This is the third largest cluster. Customers in this segment, buy in big quantities, place a lot of orders (more than one per day) and have been customers for 342 days on average. Their average spend per order is £30.

4.2 Regression Analysis

This section is dedicated to using multiple linear regression to predict the nubmer of orders placed by each customer from a variety of different predictors.

First, I created a baseline regression model. Second, I improved the the baseline model by scaling the predictive variables and by adding a categorical predictor. Third, I used the random forest algorithm to further improve the model. Lastly, I assess the peformance of each regression model with and without cross-validation.

Linear regression aims at finding linear relationships between two or more continuous variables. A multiple linear regression model can be summarised by the following equation:

**y = β0 + β1x1 + β2x2 + ··· βkxk + ε**

where:

  • y is the predicted variable
  • β0 is the y-intercept of the equation
  • β1, β2,..., βk are the coefficient of the predictive variables
  • x1, x2,...,xk are the predictive variables
  • ε is the random error factor

(Swaminathan, 2018)

In [115]:
# recall data for regression 
joined.head()
Out[115]:
total_quantity total_spending total_cancelled_orders total_orders_placed country_of_origin average_price_of_items days_as_customer
CustomerID
12346.0 0 0.00 1 2 United Kingdom 1.040000 326.0
12347.0 2458 4310.00 0 182 Iceland 2.644011 367.0
12348.0 2341 1797.24 0 31 Finland 5.764839 358.0
12349.0 631 1757.55 0 73 Italy 8.289041 19.0
12350.0 197 334.40 0 17 Norway 3.841176 310.0

4.2.1 Baseline Model

My baseline model aims at predicting the number of total orders placed by a customer from:

  • their total numer of orders cancelled
  • the average price of products purchased
  • the total quantity of items purchased.
In [116]:
#regression model 
m1_baseline = sm.ols('total_orders_placed ~ total_cancelled_orders\
            + average_price_of_items + total_quantity',\
            joined).fit()

m1_baseline.summary()
Out[116]:
OLS Regression Results
Dep. Variable: total_orders_placed R-squared: 0.432
Model: OLS Adj. R-squared: 0.432
Method: Least Squares F-statistic: 1097.
Date: Mon, 14 May 2018 Prob (F-statistic): 0.00
Time: 19:34:04 Log-Likelihood: -28517.
No. Observations: 4328 AIC: 5.704e+04
Df Residuals: 4324 BIC: 5.707e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 46.7812 2.804 16.686 0.000 41.285 52.278
total_cancelled_orders 16.1972 0.396 40.880 0.000 15.420 16.974
average_price_of_items -0.0141 0.021 -0.669 0.504 -0.056 0.027
total_quantity 0.0129 0.001 20.695 0.000 0.012 0.014
Omnibus: 7904.201 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24238783.825
Skew: 12.984 Prob(JB): 0.00
Kurtosis: 368.700 Cond. No. 5.06e+03

4.2.1.1 Model Explaination

From the summary of the baseline model I can infer that:

  • When total cancelled orders, average price of items, and total quantity is 0, total orders amount to 46.8.
  • Average price of items is negatively associated with number of orders. This means that, as the average price of items purchased increases, the number of orders placed decreases.
  • The model has a R squared value of 0.432. This means that the model explains 43.2% of the total variation in "number of orders". R squared is also used as a measure of performance of a regression and it will be used later to compare the baseline model to the other models created.

4.2.2 Scaled Baseline Model

The first improvement I want to implement to the baseline model, is to scale the predictive variables. This step is advisable becasue the three variables are expressed in three different units (i.e. number of orders, GBP and number of items). Scaling the variables makes sure that they are all centred around 0 and will have the same standard deviation.

In [117]:
# Scale the predicting variables because they are on different scales
# (the first one is in orders, and the second one is in pounds.)

# create a list that includes the columns to scale
cols = ['total_cancelled_orders', 'average_price_of_items', 'total_quantity']

#scale columns by subtracting mean and dividing by standard deviation
scaled_for_reg = pd.DataFrame(scale(joined[cols]),
                   index=joined.index,
                   columns=cols)

#add column with variable to predict 
scaled_for_reg['total_orders_placed'] = joined['total_orders_placed']

#this is the second model with the scaled columns.
m2_scaled = sm.ols('total_orders_placed ~ total_cancelled_orders\
            + average_price_of_items + total_quantity',\
            scaled_for_reg).fit()

m2_scaled.summary()
Out[117]:
OLS Regression Results
Dep. Variable: total_orders_placed R-squared: 0.432
Model: OLS Adj. R-squared: 0.432
Method: Least Squares F-statistic: 1097.
Date: Mon, 14 May 2018 Prob (F-statistic): 0.00
Time: 19:34:04 Log-Likelihood: -28517.
No. Observations: 4328 AIC: 5.704e+04
Df Residuals: 4324 BIC: 5.707e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 93.9378 2.675 35.116 0.000 88.693 99.182
total_cancelled_orders 119.1708 2.915 40.880 0.000 113.456 124.886
average_price_of_items -1.7888 2.675 -0.669 0.504 -7.033 3.456
total_quantity 60.3286 2.915 20.695 0.000 54.613 66.044
Omnibus: 7904.201 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24238783.825
Skew: 12.984 Prob(JB): 0.00
Kurtosis: 368.700 Cond. No. 1.52

4.2.2.1 Model Explaination

From the summary of the scaled, improved model I can infer that:

  • When total cancelled orders, average price of items, and total quantity is 0, total orders amount to 93.9.
  • Average price of items is still negatively associated with number of orders. In the scaled model, this association is slightly stronger, in fact, for a positive change of average price of items, the number of orders decreases by 1.78 orders as opposed to 0.01.
  • The model's R squared has not changed.

4.2.3 Predictive Checking

Since the aim of this regression model is to predict future number of orders places by a potential customer, it is important to look at the distribution of actual "total_orders_placed" VS the distribution of "total_orders_placed" predicted by the baseline model, and by the scaled baseline model.

To do so, I am using Seaborn's kdeplot() fuction which gives a smooth estimate of the distribution using kernel density estimation (VanderPlas, 2016). Using kdeplot() I plotted the distribution of actual values, the distribution of the values predicted by the baseline model, and the distribution of the values predicted by the scaled model. Finally, I have plotted the three curves in the same graph to assess if the regression models are "picking up" the overall shape of the data.

4.2.3.1 Distribution of Actual Values
In [118]:
#Distribution of actual number of oders
sns.kdeplot(joined['total_orders_placed'], shade=True, color = 'black')
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x121ae3cc0>
4.2.3.2 Distribution of Baseline Model Fitted Values
In [119]:
#distribution of predicted numbers of orders
sns.kdeplot(m1_baseline.fittedvalues, shade=True, color='red')
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x1228806a0>
4.2.3.3 Distribution of Scaled Model Fitted Values
In [120]:
#distribution of predicted numbers of orders second prediction 
sns.kdeplot(m2_scaled.fittedvalues, shade=True, color='green')
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x121ca3710>
4.2.3.4 Comparison of Distributions
In [121]:
#make put all in the same graph
f, ax = plt.subplots(1, figsize=(10, 6))
sns.kdeplot(joined['total_orders_placed'], shade=True, ax=ax, label='actual', color= 'black')
sns.kdeplot(m2_scaled.fittedvalues, shade=True, ax=ax, label='scaled', color = 'orange')
sns.kdeplot(m1_baseline.fittedvalues, shade=True, ax=ax, label='baseline', color = 'red')
plt.show()

From the graph above, it is possible to understand that scaled and baseline model's fitted values (red and orange graphs) have the same distribution and that they one captures the overall shape of the actual values (black graph), however the values are overestimated by the regression models.

To get a better distribution match, I am going to further improve the model by adding a categorical variable as a predictor in the new model.

4.3 Adding a Categorial Variable

Categorical variables can be added to regression models as dummy variables. Python's sm.ols() automatically recognise string variables and treats them as categorical.

For this project, I decided to create a variable "continent_of_origin" and use it as a predictor in the regression model. Since the majority of customers are based in europe, I have decided to sub-divide european countries according to european area (i.e. north, south, east and west europe). The classification of the countries was made according to the United Nation's official geographic regions composition (UN, n.d.).

In [122]:
# add "country_of_origin" variable to scaled dataset
scaled_for_reg['country_of_origin'] = joined['country_of_origin']
In [123]:
#create lists of continents. 
north_europe=['Iceland','Norway','Denmark','Sweden','Finland','Lithuania',\
             'EIRE','Channel Islands','United Kingdom']
east_europe=['Poland','Czech Republic']
south_europe=['Greece','Italy', 'Malta','Spain','Portugal',]
west_europe=['Switzerland', 'Austria','Belgium', 'Netherlands']
asia=['Cyprus','Bahrain','Israel','Saudi Arabia','United Arab Emirates','Singapore',\
      'Japan','Lebanon']
australia=['Australia']
africa=['RSA']
america=['USA','Brazil','Canada']
In [124]:
# Code country of origin by continent 

# Create a function that will assign "unspecified"to all variables unless
# The content of the row is in the list declared above. 

# Declare function
def continent(row):
    #assing all rows to unspecified 
    value = "unspecified"
    # but, if country is contained in the list north_europe, 
    if row['country_of_origin'] in north_europe:
        # give value europe and so on... 
        value = 'north europe'
    if row['country_of_origin'] in south_europe: 
        value = 'south europe'
    if row['country_of_origin'] in east_europe:
        value = 'east europe'
    if row['country_of_origin'] in west_europe:
        value = 'west europe'
    if row['country_of_origin'] in asia:
        value ='asia'
    if row['country_of_origin'] in australia:
        value='australia'
    if row['country_of_origin'] in africa:
        value = 'africa'
    if row['country_of_origin'] in america:
        value= 'america'
    #at the end, return value
    return value

# Create a new column 'continent_of_origin' and apply the function. 
scaled_for_reg['continent_of_origin'] = scaled_for_reg.apply(continent, axis=1)
In [125]:
scaled_for_reg.head()
Out[125]:
total_cancelled_orders average_price_of_items total_quantity total_orders_placed country_of_origin continent_of_origin
CustomerID
12346.0 -0.138209 -0.039807 -0.241593 2 United Kingdom north europe
12347.0 -0.274125 -0.027119 0.282003 182 Iceland north europe
12348.0 -0.274125 -0.002433 0.257080 31 Finland north europe
12349.0 -0.274125 0.017533 -0.107179 73 Italy south europe
12350.0 -0.274125 -0.017650 -0.199628 17 Norway north europe
In [126]:
# create new model
m3_improved = sm.ols('total_orders_placed ~ total_cancelled_orders\
            + average_price_of_items + continent_of_origin + total_quantity', scaled_for_reg ).fit()
# look at model summary
m3_improved.summary()
Out[126]:
OLS Regression Results
Dep. Variable: total_orders_placed R-squared: 0.435
Model: OLS Adj. R-squared: 0.434
Method: Least Squares F-statistic: 302.5
Date: Mon, 14 May 2018 Prob (F-statistic): 0.00
Time: 19:34:06 Log-Likelihood: -28505.
No. Observations: 4328 AIC: 5.703e+04
Df Residuals: 4316 BIC: 5.711e+04
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 100.9845 175.679 0.575 0.565 -243.437 445.406
continent_of_origin[T.america] -209.7612 185.246 -1.132 0.258 -572.938 153.416
continent_of_origin[T.asia] -43.6005 179.159 -0.243 0.808 -394.844 307.644
continent_of_origin[T.australia] -160.7937 185.255 -0.868 0.385 -523.989 202.402
continent_of_origin[T.east europe] -45.5345 187.809 -0.242 0.808 -413.737 322.668
continent_of_origin[T.north europe] -4.7807 175.701 -0.027 0.978 -349.246 339.684
continent_of_origin[T.south europe] -15.3724 177.005 -0.087 0.931 -362.392 331.648
continent_of_origin[T.unspecified] -23.6991 176.153 -0.135 0.893 -369.050 321.652
continent_of_origin[T.west europe] -22.3453 177.103 -0.126 0.900 -369.557 324.867
total_cancelled_orders 120.0132 2.923 41.060 0.000 114.283 125.744
average_price_of_items -1.8155 2.670 -0.680 0.497 -7.051 3.420
total_quantity 60.6905 2.931 20.704 0.000 54.944 66.437
Omnibus: 7913.861 Durbin-Watson: 1.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24157473.530
Skew: 13.023 Prob(JB): 0.00
Kurtosis: 368.078 Cond. No. 271.

4.3.1 Model Explaination

From the summary of the model containing country of origin as a predictor, I can infer that:

  • The R squared value increases to 0.435
  • Coefficients suggest that being from the American continent, decreases the most the number of orders placed. While being from north europe decreases it the least, suggesting that the latter, brings the most customers (as observed above).

4.3.2 Predictive Checking

In [127]:
# graph baseline, scaled and categorical model
f, ax = plt.subplots(1, figsize=(15, 5))
# first plot 
sns.kdeplot(joined['total_orders_placed'], shade=True, ax=ax,\
            label='actual', color = 'black')
# second plot 
sns.kdeplot(m2_scaled.fittedvalues, shade=True, ax=ax, \
            label='scaled', color= 'orange')
# third plot 
sns.kdeplot(m3_improved.fittedvalues, shade=True, ax=ax,\
            label='with categorical', color='green')

plt.show()

The latest model, does not seem to improve the prediction, as the distribution curve has remained unchanged from the one of the scaled model.

Next, I will explore the models' performance, to have a better idea of their predictive power.

4.4 Model performance

To assess the model performance, I will compare the R squared, Mean Square Error (MSE), and Mean Absolute Error (MAE) of the baseline model with the measures of the latest model.

The MAE measures the average error in absolute terms of a set of forecasts. The MSE measures the average of the squares of the error in a set of forecasts. Along with R squared, MAE and MSE are generally used as a measure of model performance.

In [128]:
#R^2
r2 = pd.Series({'Baseline Model': metrics.r2_score(joined['total_orders_placed'],
                                              m1_baseline.fittedvalues),
                'Latest Model': metrics.r2_score(scaled_for_reg['total_orders_placed'],
                                              m3_improved.fittedvalues)})
r2
Out[128]:
Baseline Model    0.432277
Latest Model      0.435298
dtype: float64
In [129]:
# MSE
mse = pd.Series({'Baseline Model': metrics.mean_squared_error(joined['total_orders_placed'],
                                                        m1_baseline.fittedvalues),
                 'Latest Model': metrics.mean_squared_error(scaled_for_reg['total_orders_placed'],
                                                         m3_improved.fittedvalues)})
In [130]:
# MAE
# taking the MAE between actual values and the values forecasted by the baseline model
mae = pd.Series({'Baseline Model': metrics.mean_absolute_error(joined['total_orders_placed'],
                                                         m1_baseline.fittedvalues),
                 #taking the MAE between actual values and the values forecasted by m3 model
                 'Latest Model': metrics.mean_absolute_error(scaled_for_reg['total_orders_placed'],
                                                          m3_improved.fittedvalues)})
In [131]:
perf = pd.DataFrame({'MAE': mae,
                     'MSE': mse,
                     'R^2': r2})
perf
Out[131]:
MAE MSE R^2
Baseline Model 67.967766 30941.743636 0.432277
Latest Model 68.615029 30777.065205 0.435298

From the comparison of MAE, MSE and R squared, It is important to note that R squared and MSE slightly improved from the baseline model and latest model, while MAE resulted higher in the latest model.

Since the aim of the regression model is to make predictions, it is appropriate to assess the validity of the model with cross-validation. Cross-validation is a validation technique for assessing how a model will perfom on a different and independent data set. The first step to cross-validation is to split the data set into train set and test set. The forecasting model is then trained on the train set and consequently validated on the test set. (Seni and Elder, 2010).

4.4.1 Cross-Validation

4.4.2 Split the Data into Train and Test set

In [132]:
# Split dataset into train (80%) and test set (20%)
# names of predictive variables
cols = ['total_cancelled_orders', 'average_price_of_items',\
            'total_quantity'] 

# create test and train sets 
x_train, x_test, y_train, y_test =\
                    model_selection.train_test_split(scaled_for_reg[cols], 
                                                    joined['total_orders_placed'],
                                                    test_size=0.8)

4.4.3 Train Model on Train Set

In [133]:
# Train data on train set
# f is a variable that includes the regression equation
f = 'total_orders_placed ~ total_cancelled_orders + average_price_of_items + total_quantity'

# this is the regression model
m1_train = sm.ols(f, x_train.assign(total_orders_placed=y_train)).fit()

4.4.4 Compare Parameters and R Squared

In [134]:
# create data frame containin the parameters of full set and of train set
pd.DataFrame({'Full Dataset': m1_baseline.params,
              'Train Set': m1_train.params})
Out[134]:
Full Dataset Train Set
Intercept 46.781173 87.001987
total_cancelled_orders 16.197225 77.201597
average_price_of_items -0.014149 -184.700028
total_quantity 0.012851 121.809338
In [135]:
# explore r quared of full data set and of train set 
pd.Series({'Full Dataset': m1_baseline.rsquared,
           'Train Set': m1_train.rsquared})
Out[135]:
Full Dataset    0.432277
Train Set       0.440381
dtype: float64

4.4.5 Predict Test Set Values and Explore R Squared

In [136]:
# make predictions on test set. 
test_prediction = m1_train.params['Intercept'] + \
            x_test.dot(m1_train.params.drop('Intercept'))

#create a series with the r squared of baseline, train set data and test set data
pd.Series({'0-Full Dataset': m1_baseline.rsquared,
           '1-Train Set': m1_train.rsquared,
           '2-Test Set': metrics.r2_score(y_test,
                                        test_prediction)})
Out[136]:
0-Full Dataset    0.432277
1-Train Set       0.440381
2-Test Set       -0.378059
dtype: float64

From the comparisons of R Squared values of the full data, train data set and test data, it is possible to note that R squared increases from the full data to the train data, but it decreases, and becomes negative, for the test data.

This is an example of "overfitting". Overfitting happens when a model describes the random error in the data rather than the relationships between variables (Frost, 2017). This means that the model created is a good fit to make preditions in the specific dataset used for analysis (R squared is 0.43), however it has no forecasting power over any other independent data set (R squared is -0.37).

4.5 Improve the model: Random forest

The Random Forest is a machine learning algorithm that is used on classification or regression problems. It is one of the most used algorithms because of its semplicity and effectiveness.

In [137]:
# run random forest regression. we can fit and predict in the same line
m4_RandomF = RandomForestRegressor().fit(joined[cols], joined['total_orders_placed'])\
                            .predict(joined[cols])

4.5.1 Assess Performance

In [138]:
# examine MAE MSE and R suqared of FR 
rf = pd.Series({'R^2': metrics.r2_score(joined['total_orders_placed'], m4_RandomF), 
                'MSE': metrics.mean_squared_error(joined['total_orders_placed'], m4_RandomF),
                'MAE': metrics.mean_absolute_error(joined['total_orders_placed'], m4_RandomF)})

# add RF performance indicators to the table of baseline and m3 model
pd.concat([perf, 
           pd.DataFrame({'Random Forest': rf}).T])
Out[138]:
MAE MSE R^2
Baseline Model 67.967766 30941.743636 0.432277
Latest Model 68.615029 30777.065205 0.435298
Random Forest 22.028809 5153.101810 0.905450

From the table above, it is important to note that the Random Forest regressor performs considerably better than the baseline model and the latest model with scaled variables and a categorical predictor.

Its MAE went from 68 to 22, MSE went from 30,000 to 6,000 and finally, its R squared value is 0.90, meaning that the model picks up almost 90% of the variation of the predicted variable.

4.5.2 Random Forest Cross Validated

Just as I did for the previous model, I am going to assess the performace of the Random Forest model with cross-validation to see if this model can be used to make predictions.

In [139]:
# previously the model was fitted on the same dataset where the predictions were made.
# in this case, we are making the predictions on the text set

# random forest cross validation. 
m4_cv = RandomForestRegressor().fit(x_train, y_train).predict(x_test)
In [140]:
# get MSE, MAE e R2 for cross validated RF
rf_cv = pd.Series({'R^2': metrics.r2_score(y_test, m4_cv), 
                'MSE': metrics.mean_squared_error(y_test, m4_cv),
                'MAE': metrics.mean_absolute_error(y_test, m4_cv)})
pd.concat([perf, 
           pd.DataFrame({'RF': rf}).T,
           pd.DataFrame({'RF-CV': rf_cv}).T])
Out[140]:
MAE MSE R^2
Baseline Model 67.967766 30941.743636 0.432277
Latest Model 68.615029 30777.065205 0.435298
RF 22.028809 5153.101810 0.905450
RF-CV 58.062258 37447.041282 0.346153

As expected, the cross-validated Random Forest regressor performance are much worst compared to the non-cross-validated performance measures. This once again, is a sign of overfitting and suggests that the regression model has little predictive power over an independent dataset.

5.0 Conclusion and Reccomendations

In conlusion, this project was carried out using a dataset containing 541,909 data points corresponding to e-commerce transactions of an online UK retailer. The aim was to segment customers and to create a regression model that could forecast the number of orders placed by each customer.

To carry out the project, I first pre-processed the data by inspecting it, transforming existing variables, creating new variables, dealing with missing values, grouping the data at customer level and finally performing descriptive statistics and data visualization. The second step was then to segment customers into categories, to do so, I have used k-means cluster analysis. From this, I was able to divide customers into three main classes:

  1. The Abituals
  2. The New-comers
  3. The Big Spenders

After performing clustering analysis, I have modeled a multiple linear regression with the aim of forecasting the number of orders placed by each customers. I first create a baseline model, which was then improved by scaling the predictive variables and by adding a categorical variable as predictor. Model performance was assessed with and without cross-validation and this showed strong signs of overfitting. To overcome this problem and to improve the predictive power of the model, I have used the Random Forest algorithm and assessed its performance with and without cross-validation. Once again, the performance indicators suggested that the regression model has little predictive power over an independent dataset.

To solve this problem, for future research it is advisable to join the current dataset with information about the demographics and shopping habits of each customer. This will add valuable information that can be used by the clustering algorithm to further segment customers and by the regression model to spot meaningful relationships within the variables in the data.

Bibliography.

Archive.ics.uci.edu. (2015). UCI Machine Learning Repository: Online Retail Data Set. [online] Available at: http://archive.ics.uci.edu/ml/datasets/online+retail# [Accessed 14 May 2018].

Chen, J., Ching, R. and Lin, Y. (2004). An extended study of the K-means algorithm for data clustering and its applications. Journal of the Operational Research Society, 55(9), pp.976-987.

Frost, J. (2017). Overfitting Regression Models: Problems, Detection, and Avoidance - Statistics By Jim. [online] Statistics By Jim. Available at: http://statisticsbyjim.com/regression/overfitting-regression-models/ [Accessed 14 May 2018].

Gove, A. (2017). Using the elbow method to determine the optimal number of clusters for k-means clustering. [online] Bl.ocks.org. Available at: https://bl.ocks.org/rpgove/0060ff3b656618e9136b [Accessed 14 May 2018].

Kaggle.com. (2018). E-Commerce Data | Kaggle. [online] Available at: https://www.kaggle.com/carrie1/ecommerce-data [Accessed 14 May 2018]. Trochim 2006 (https://www.socialresearchmethods.net/kb/statdesc.htm)

Koehrsen, W. (2018). Visualizing Data with Pairs Plots in Python – Towards Data Science. [online] Towards Data Science. Available at: https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166 [Accessed 14 May 2018].

Raschka, S. (2014). About Feature Scaling and Normalization. [online] Dr. Sebastian Raschka. Available at: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html [Accessed 14 May 2018].

Seni, G. and Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1), pp.1-126.

Swaminathan, S. (2018). Linear Regression — Detailed View – Towards Data Science. [online] Towards Data Science. Available at: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86 [Accessed 14 May 2018].

UN (2018). UNSD — Methodology. [online] Unstats.un.org. Available at: https://unstats.un.org/unsd/methodology/m49/ [Accessed 14 May 2018].

VanderPlas, J. (2016). Python data science handbook. 1st ed.

Ye, L., Qiuru, C., Haixu, X., Yijun, L. and Guangping, Z. (2013). Customer Segmentation for Telecom with the k-means Clustering Method. Information Technology Journal, 12(3), pp.409-413.