Customer Segmentation¶

1.0 Introduction

The aim of this project is to perform customer segmentation on a dataset of e-commerce transactions. After dividing customers into segments, linear regression is performed to predict the number of orders placed by each customer depending on a variety of different variables.

The data in question is an open source dataset created by the UCI Machine Learning Repository containing 541,909 E-commerce transactions dated from 01/12/2010 until 09/12/2011 (UCI, 2018).

I have downloaded the dataset from Kaggle.com, a platform for predictive modelling and analytics competitions which stores open source datasets uploaded by companies or users (Kaggle, 2018).

The data set contains transactional data of a UK-based online retailer that sells unique all occasion gifts to wholesales and end users. The data set variabled are explained below:

InvoiceNo: Nominal. Transaction unique identifier. If this code starts with the letter 'C', the order was cancelled
StockCode: Nominal. Product unique identifier
Description: Nominal. Product name
Quantity: Numeric. The quantities of each product per transactio.
InvoiceDate: Numeric. Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Numeric. Unit price. Numeric, Product price per unit in sterling.
CustomerID: Nominal. Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Nominal. Country name. Nominal, the name of the country where each customer resides.

The project is divided into four main sections:

Introduction
Data pre-processing
Data visualization
Data analysis and discussion of results
Conclusion and recommendations for future work

1.1 Download Data¶

The data was downloaded and unzipped manually in .csv format in the working directory of the project.

The data set contained characters that were not encoded as UTF-8 therefore, it was opened in a text editor and encoded to UTF-8 before reading it into jupyter notebook.

1.2 Import Pyton Libraries¶

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import pysal as ps
import geopandas as gpd
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm

1.3 Read Data on Jupyter Notebook¶

#this line reads .csv files into the script
data = pd.read_csv('data.csv')

1.4 Inspect the data¶

Before pre-processing and performing analysis, I explored the data with descriptive statistics. Descriptive statistics is used to describe the basic features of the dataset we want to analyse and, together with data visualization, it allows us to understand the shape and general attributes of our data (Trochim, 2006).

#look at first 10 rows of the dataset 
data.head(10)

1.4.1 Summary Statistics¶

#describe() returs count, mean, standard deviation, minimun value and maximum value
# of the numerical variables of the dataset
data.describe()

From this first glimpse at the data it is important to note that:

roughly 14,000 unique customers IDs are missing from the dataset
there are negative values of unit prices and quantity of products purchased
on average, customer buy 9.5 products at each transaction and spend £ 4.6

2.0 Data Pre-Processing

2.1 Assign data to the right type¶

Assigning data to the right type makes sure that, when performing analysis, calculations are consistent, and data is processed in the right way.

To check to what type the variables are assigned to by default, I am using the .info() function.

#get information about the dataset and what type each variable is assigned to by default
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

From this, I am choosing to format the “InvoiceDate” column to a datetime variable, and the “Country” column to a categorical variable.

%%time
#make date format consistent 
data['InvoiceDate']=pd.to_datetime((data['InvoiceDate']))

CPU times: user 1min 40s, sys: 98 ms, total: 1min 40s
Wall time: 1min 40s

#make country a categorical variable
data["Country"] = data["Country"].astype('category')

2.2 Missing Values¶

As the summary statistics suggested, data is missing from the data set. For this reason, I decided to create a function that returns a table with the number of missing observations per variable.

## This function counts and creates a dataframe containing the nuber of missing values per column.

# Declare the name of the fuction 
def count_missing(data):
    # creates a dataframe containing the sum of all missing values
    missing_count = pd.DataFrame(data.isnull().sum(),\
                                 # names column "Count" and sorts values in descending order
                                 columns=['Count']).sort_values(by=['Count'],\
                                                                ascending=False)
    return missing_count

#apply the function to the data 
count_missing(data)

As expected, 13,5080 CustomerIDs are missing from the dataset as well as 1,454 product descriptions.

2.2.1 Dealing With Missing Values¶

To deal with missing values, I decided to fill each blank description cell with “Missing”, and to discard all the rows with a missing CustomerID. This decision was made because the aim of this project is to explore the data at customer level, therefore all observations that are not assigned to a particular customer cannot be used for analysis.

# replace missing descriptions with "Missing"
data["Description"] = data["Description"].fillna("Missing")

#delete all rows with a missing "CustomerID"
data.dropna(axis = 0, subset = ['CustomerID'], inplace = True)

#check that there aren't any missing values
count_missing(data)

2.3 Create New Variables¶

2.3.1 Canceled Orders¶

Create a new varialble 'Cancel'¶

As explained in the introduction of this report, cancelled orders are the ones whose “InvoiceNo” start with a “C”. This allowed me to create a binomial variable that indicates whether an order was cancelled or not. The new variable “Cancel” takes the value of 1 of the order was cancelled, or the value of 0 if the order was not cancelled.

To do so, I have created a function named cancel() which assigns the value 0 to all rows, however, if the first character of each row in the column "InvoiceNo" is "C", it assigns the value of 1

%%time
#create a function that will assign 0 to all variables unless the first character of 'InvoiceNo' in 'C'
def cancel(row):
    value = 0
    if row['InvoiceNo'][0] == 'C':
        value = 1
    return value

# Create a new column 'Cancel' to attach to 'data' and set it to the value returned 
  #by the function cancel().
    
# The code 'axis=1' makes the apply function process the dataset by row, 
  #as opposed to by column which is the default option.
data['Cancel'] = data.apply(cancel, axis=1)

# the %%time code returns how much time is needed to run this function.

CPU times: user 30.8 s, sys: 122 ms, total: 30.9 s
Wall time: 30.9 s

#check that there are some cancelled orders and that the new column is in the data
data[data['Cancel']== 1].head(3)

To better understand the frequency of cancelled orders, I calculated the probability of an order to be cancelled, and visualised how many orders were cancelled in the past year with a vertical bar graph.

#check how many cancelled VS non-cancelled orders there are. 
canceled = data['Cancel'].value_counts()

# rename the axis so that they show in the picture
canceled = canceled.rename({0: 'Non-canceled', 1: 'Canceled'})

# bar graph of canceled vs non-canceled orders 
canceled.plot.barh();

#calculate the probability of an order to be cancelled.
pco = (8905/(397924+8905))
pco

0.021888803403887137

During the past year, 8905 orders were cancelled, this are 2.2% of all orders placed.

2.3.2 Transaction Spending¶

To continue, I have decided to calculate the total spending of each transaction by multiplying “Quantity” by “UnitPrice”. I then attached the result to a new variable called “Tra_Spending”.

data['Tra_Spending']= (data['Quantity'])*(data['UnitPrice'])
data.head(3)

2.4 Grouping data by "CustomerID"¶

The next step is to group all observation by customerID. This will make it easier to explore customers characteristics. While doing so, I created a variable called "days_as_customer" by taking the difference in days between the date of a customer's last order, and the date of their first order.

2.4.1 Sum "Quantity" , "Tra_Spending" and "Cancel"¶

#group dataset by cystomerID to start creating customer profiles. 
#The following code groups the data by customerID and sum quantity, transaction spending, and number of cancelled orders
aggQ=data.groupby('CustomerID')['Quantity','Tra_Spending', 'Cancel'].sum()

aggQ.head(3)

2.4.2 Count "InvoiveNo"¶

#group by custmer and count unique invoiceNo to see how many oder the customer has placed in the time the data spans.
Inv = data.groupby('CustomerID')['InvoiceNo'].count()
Inv.head(3)

CustomerID
12346.0      2
12347.0    182
12348.0     31
Name: InvoiceNo, dtype: int64

2.4.3 Create "days_as_customer"¶

#### Create a varible that shows how much time a person has been a customer.

#group by customerID and select the mininum date (i.e. the date at which the first order was placed)
#as_index=false means that customer ID is a column and not the index
days= data.groupby('CustomerID', as_index=False)['InvoiceDate'].min()

#edit names of columns 
days.columns = ['CustomerID', 'Earliest_Invoice']

#create a column with the date of the latest order placed
days['Latest_Invoice'] = pd.to_datetime((data['InvoiceDate']).max())
days.set_index('CustomerID')
days.head(3)

#calculate how many days the person has been a customer by subtracting the date of the latest invoice
# to the date of the first invoice
days['days_as_customer'] = 1 + (days.Latest_Invoice-days.Earliest_Invoice).astype('timedelta64[D]')

#delete earliest invoice and laterst invoice columns
days.drop(['Earliest_Invoice', 'Latest_Invoice'], axis=1, inplace=True)
days.head()

2.4.4 Add "Country" of customer¶

#grouped data by customerID and pasted the value of 'Country' 
country = data.groupby('CustomerID')['Country'].first()
country.head(3)

CustomerID
12346.0    United Kingdom
12347.0           Iceland
12348.0           Finland
Name: Country, dtype: object

2.4.5 Average "UnitPrice"¶

#group by 'customerID' and average unitprice to see if the person buys expensive items or not
meanex = data.groupby('CustomerID')['UnitPrice'].mean()
meanex.head(3)

CustomerID
12346.0    1.040000
12347.0    2.644011
12348.0    5.764839
Name: UnitPrice, dtype: float64

2.4.6 Join data together to create new dataset¶

# create list of datafames to join that have the same index
to_join=[Inv, country, meanex]
# use join() to join aggQ with the three dataframes listed above
joined = aggQ.join(to_join)

joined = joined.join(days.set_index('CustomerID'))

# edit column names
joined.columns = ['total_quantity', 'total_spending',\
                    'total_cancelled_orders', 'total_orders_placed',\
                    'country_of_origin', 'average_price_of_items', 'days_as_customer']

#delete all negative values
joined = joined.drop(joined[(joined.total_quantity < 0)].index)
joined = joined.drop(joined[(joined.total_spending < 0)].index)

The new dataset contains the following variables:

CustomerID: customer unique identifier
total_quantity: total number of items purchased by customer
total_spending: total amount spent by the customer
total_cancelled_orders: total number of orders cancelled by the customer
total_orders_placed: total number of orders placed by the customer
country_of_origin: country of origin of the customer
average_price_of_items: average price of items purchased
days_as_customer: days between first and last orders placed by the customer

# look at first rows of data set
joined.head()

3.0 Data Visualization¶

I decided to explore the data visually before proceeding with analysis.

How many customers are in the dataset?¶

# look at the shape of the dataset joined. 
# this gives us the number of customers in the dataset because its rows represent a customer. 
joined.shape

(4328, 7)

Where are they from?¶

# return the top 5 countries from which customers are from by 
# counting the frequency of each country of origin and plotting a horizontal bar graph 
joined['country_of_origin'].value_counts()[:5].plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x121db0320>

In the period between 01/12/2010 and 09/12/2011 the website served 4,372 customers. Additionally, even though the website sells internationally, the majority of their customers placed orders from the United Kingdom. This could be due to high international shipping fees or poor marketing presence in foreign countries. International expansion could be an opportunity of profit for the company.

How many orders were placed?¶

#check how many unique order numbers are in the original dataset "data" 
data['InvoiceNo'].unique().shape

(22190,)

#check from how many countries orders are placed from 

# select country column unique values and return the shape of the table. 
joined['country_of_origin'].unique().shape

(37,)

Where are the orders placed from?¶

#code returns the first 5 coutries from which orders are placed from. 
data['Country'].value_counts()[:5].plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x121d9aeb8>

Similarly to the customer-level summaries, from these simple descriptive statistics we understand that:

22,190 orders were placed in the period between 01/12/2010 and 09/12/2011. This means that, on average, the website received 59.5 orders per day. This is a number to monitor and possibly improve.
Orders are placed from 37 counties, however the majority of orders, as expected, are placed from the UK. Again, growing its presence in foreign countries, could be an important opportunity for the business.

Which countries bring the most revenue?¶

%%time
# make a copy of the dataset
data_1 = data

# split date variable in "year", "month", "day", "hour", "minute", "week" to make it easier to plot
data_1['Date'] = data['InvoiceDate'].dt.date
data_1['Day'] = data['InvoiceDate'].dt.day
data_1['Month'] = data['InvoiceDate'].dt.month
data_1['Year'] = data['InvoiceDate'].dt.year
data_1['Hour'] = data['InvoiceDate'].dt.hour
data_1['Week'] = data['InvoiceDate'].dt.week
data_1['Minute'] = data['InvoiceDate'].dt.minute

#set up graph 
fig, ax = plt.subplots(2,1, figsize=(9, 9))

### create a dataframe "sales" which is grouped by date and shows the sum of all transaction spending in one month.

## subset data_1 to include year, month and transaction spending. 
## group the subset by year and month and sum transaction spending. 
sales = data_1[['Year', 'Month', 'Tra_Spending']].groupby(['Year', 'Month']).sum().reset_index()

#create a variable called 'Date' which contains year, month and day
#set day to be one to stand for the start of the month. 
sales['Day'] = 1
#convert 'Date' to pd datetime object
sales['Date'] = pd.to_datetime(sales[['Year', 'Month', 'Day']])

#set it as the index of the dataframe
sales = sales.set_index('Date')
#delete year, month and day columns
sales = sales.drop(['Day', 'Month', 'Year'], axis=1)

#create a plot of sales. 
ax[0].plot(sales)
#set the title to 'Monthly revenue'
ax[0].set_title('Monthly Revenue') 

### Create a dataframe "top_country"
##  subset data_1 to contain transaction spending and country
#   group by country and sum the transaction spending 
top_country = data_1[['Tra_Spending', 'Country']]\
                .groupby(['Country']).sum().reset_index().sort_values\
                (by='Tra_Spending', ascending=False)['Country'][1:5]

#crate a loop that itarates for each country in the dataframe top_country
for c in top_country:
    sales = data_1[data_1['Country'] == c]
    #subgroup sales and group by year and month. 
    sales = sales[['Year', 'Month', 'Tra_Spending']].groupby(['Year', 'Month']).sum().reset_index()
    #set day to 1 
    sales['Day'] = 1
    #create date with day, month and year
    sales['Date'] = pd.to_datetime(sales[['Day', 'Month', 'Year']])
    #set index of sales to be 'Date'
    sales = sales.set_index('Date')
    #delete columns year, month, and day
    sales = sales.drop(['Year', 'Month', 'Day'], axis=1)
    #plot
    ax[1].plot(sales.cumsum(), label=c)    
    #this line adds a legend to the graph
    ax[1].legend()
    #this line sets the title of the graph
    ax[1].set_title('Top countries by revenue')            
    
plt.show()

CPU times: user 2.31 s, sys: 161 ms, total: 2.47 s
Wall time: 2.19 s

The Monthly Revenue graph show an upward trend between December 2010 and November 2011. The graph then shows a drop in sales in December 2011 however this is given by the fact that the total revenue of the month of December 2011 includes only the revenue of the first 9 days of the month.

The second graph shows the revenue by countries. The graph shows that the country that brings the highest monthly revenue is the Netherlands, followed by the republic of Ireland, Germany, and France. This is interesting because even though the vast majority of orders are placed from the United Kingdom, the latter is missing from the list of top 4 countries that bring the most revenue.

4.0 Analysis

To explore the relationships betweeen the variables of the dataset I ran the sns.pairplot() function which plot pairwise relationships in a dataset.

By default, sns.pairplot() only plot numerical columns, however I have used the categorical variable "country_of_origin" to color code the plots. The fuction generates two basic figures, histograms and scatter plots. The histograms on the diagonal show us the distribution of a single variable, while the scatter plots show us the relationship between a pair of variables (Koehrsen, 2018).

In this case, the pairplot is showing us that the variable "days_as_customer" is the only containing varied values, and that the most visible positive relationship between a pair of variables is between "total_quantity" and "total_spending". In fact, as expected, as the quantity purchased increases, the total spending also increases. Moreover, the colors of the plot, show us that the majority of customers are from the UK.

#pairplot gives you a scattered plot of all pairs of variables. 
# hue = country of origin colors each observation differently depending 
  # on their country of origin 
sns.pairplot(joined, kind='scatter', hue= 'country_of_origin');

Since we are comparing variables measured with different units, for example, "total_quantity" is expressed in number of orders while "total_spending" is expressed in GBP, I have decided to scale the dataset, and run sns.pairplot() on the scaled dataset to see if any difference can be observed.

Scaling or standardising a dataset means subtracting the mean and dividing by the standard deviation each observation. This gives all variables the properties of a standard normal distribution with μ=0 and σ=1 (Raschka,2014).

# scale dataset

#select numerical variables to scale
num_cols= ['total_quantity', 'total_spending',\
           'total_cancelled_orders', 'total_orders_placed',\
           'average_price_of_items', 'days_as_customer']

#subset dataset to contain only numerical variables
joined_toscale = joined[num_cols]

#scale dataset
joined_scaled = pd.DataFrame(preprocessing.scale(joined_toscale),
                           index=joined_toscale.index,
                           columns=joined_toscale.columns)

#add country of origin variable for coloring of pairplot
joined_scaled["country_of_origin"]= joined["country_of_origin"]
joined_scaled.head()

# new pairplot
sns.pairplot(joined_scaled, kind='scatter', hue= 'country_of_origin');

No change can be noticed in the pairplots from scaling the data.

4.1 K-means Clustering Analysis¶

Since the aim of the project is to find segments of customers based on their shopping habits, k-means cluster analysis was chosen as a method of analysis.

Clustering is the process of grouping observations into classes so that objects within one class are highly similar and are highly dissimilar to objects in another class (Ye et. al, 2013). Cluster-based approaches have been widely used since the 1970s to perform market segmentation. Their popularity led to the development of noumerous clustering algorithms. Among them, the k-means algorithm is one of the most widely used in marketing research to divide objects into classes due to its ability of processing larger amounts of data (Chen et. al, 2004).

The k-means algorithm works as follows:

K number of clusters is determined
The algorithm randomly determines initial centroids of each cluster
Euclidean distance between remaining observations and cluster centroids is calculated.
Observations are assigned to the cluster with the closest centriod
Mean of each cluster is calculated
For each cluster, if the difference of the mean value and centroid surpasses a threshold, the centroid value is replaced with the cluster mean
Step 3, 4, 5 and 6 are repeated until no centroid value is replaced (Yogyakarta et. al, 2015).

Since observations are assigned to a cluster depending on their euclidean distance to a cluster centroid, categorcal values, such as "country_of_origin" cannot be processed with the k-means clustering algorithm. For this reason, I am creating a dataset that does not contain "country_of_origin" as a variable and on which I can perform k-means clustering. Moreove, as it is not ideal for the k-means algorithm to depend on an arbitrary variable unit, the data used for clustering will also be scaled following the same procedure explained above.

# create cluster data with that contains numeric value
cluster_data_scaled = joined_scaled.loc[:, joined_scaled.columns != 'country_of_origin']
cluster_data_scaled.head(3)

Scaling the data, makes sure that all variables have the same standard deviation.

# check the mean and standard deviation of all variables. 
cluster_data_scaled.describe().loc[['mean', 'std'], :]

#now all variables have the same standard deviation.

4.1.2 Determine the Value of K - The Elbow Method¶

First step of the k-means clustering algorithm is deciding the number of clusters we want the observations to be divided into. There are several methods that help in deciding which "k" values is most suited to the data.

One method that determines an appropriate "k" value, is the "Elbow Method". The elbow method works by running k-means clustering for a range of values of "k" (from 1 to 10), and for each value of "k", it calculates the sum of squared errors (SSE).

The method then plots SSEs for each value of "k". From the plot we want to select the smallest value of "k" that still has a small SSE (Gove, 2017).

# ELBOW METHOD # 

# Data to run in the loop 
X = cluster_data_scaled
# create empty list to store SSEs for different k means clustering
SSE = []

# for loop runs on range between 2 and 10
for k in range(2, 10):
    # set up k means for each k between 2 and 10 
    kmeans = cluster.KMeans(n_clusters=k)
    # fit k means in data
    kmeans.fit(X)
    # append SSE values into empty list 
    SSE.append(kmeans.inertia_)

# set up elbow graph 
fig = plt.figure(figsize=(7, 5))
# plot values
plt.plot(range(2, 10), SSE)
# add grid
plt.grid(True)
# add title
plt.title('Elbow curve')

Text(0.5,1,'Elbow curve')

In this case, I am selecting k = 6

4.1.3 Perform K-means clustering¶

# set up clustering analysis with k=6
k6 = cluster.KMeans(n_clusters=6)
k6

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=6, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

# fit clustering model on dataset created for clustering
np.random.seed(1234)
k6cl = k6.fit(cluster_data_scaled)

#check labels
labels=k6cl.labels_

labels

array([4, 4, 4, ..., 1, 4, 4], dtype=int32)

# count how many observations each cluster has. 
labels_tb = pd.DataFrame(joined.groupby(k6cl.labels_).size())

#rename column
labels_tb.rename(columns={0:'No Customers per Cluster'}, inplace=True)

labels_tb

#create table with mean values per cluster
table_of_means=joined.groupby(k6cl.labels_).mean()
#add average spend per order by dividind average total spending by average n of orders placed
table_of_means['average_spend_order']=\
                    table_of_means['total_spending']/table_of_means['total_orders_placed']

# look at mean of the clusters 

# .T transpose the table to make it easier to look at
table_of_means.T

4.1.4 Characterise the Clusters¶

K-means cluster analysis divided the 4,328 observations contained in the dataset into 6 clusters. From the labels table it is possible to observe that the majority of observations are contained in cluster 1 and 4, while the rest are mainly contained in cluster 0.

Looking at the mean of each cluster, I am dividing customers into three main segments.

"The Abituals": Cluster 4 customers. This is the largest cluster in size. Customers in this segment have been purchasing from the e-commerce for almost 1 year. On average, they have spent £1,572 and canceled orders rarely (they cancel on average 1.52 orders out of the 92 placed). Their average spend per order is £17.
"The New-comers": Cluster 1 customers. This is the second largest cluster in size. Customers in this segment have recently discovered the e-commerce. They cancel their orders rarely (0.5 canceled orders out of 45 placed) and have spent on average £660 in the last year , which amounts for £15 per order.
"The Big Spenders": Cluster 0 customers. This is the third largest cluster. Customers in this segment, buy in big quantities, place a lot of orders (more than one per day) and have been customers for 342 days on average. Their average spend per order is £30.

4.2 Regression Analysis¶

This section is dedicated to using multiple linear regression to predict the nubmer of orders placed by each customer from a variety of different predictors.

First, I created a baseline regression model. Second, I improved the the baseline model by scaling the predictive variables and by adding a categorical predictor. Third, I used the random forest algorithm to further improve the model. Lastly, I assess the peformance of each regression model with and without cross-validation.

Linear regression aims at finding linear relationships between two or more continuous variables. A multiple linear regression model can be summarised by the following equation:

**y = β0 + β1x1 + β2x2 + ··· βkxk + ε**

where:

y is the predicted variable
β0 is the y-intercept of the equation
β1, β2,..., βk are the coefficient of the predictive variables
x1, x2,...,xk are the predictive variables
ε is the random error factor

(Swaminathan, 2018)

# recall data for regression 
joined.head()

4.2.1 Baseline Model¶

My baseline model aims at predicting the number of total orders placed by a customer from:

their total numer of orders cancelled
the average price of products purchased
the total quantity of items purchased.

#regression model 
m1_baseline = sm.ols('total_orders_placed ~ total_cancelled_orders\
            + average_price_of_items + total_quantity',\
            joined).fit()

m1_baseline.summary()

4.2.1.1 Model Explaination¶

From the summary of the baseline model I can infer that:

When total cancelled orders, average price of items, and total quantity is 0, total orders amount to 46.8.
Average price of items is negatively associated with number of orders. This means that, as the average price of items purchased increases, the number of orders placed decreases.
The model has a R squared value of 0.432. This means that the model explains 43.2% of the total variation in "number of orders". R squared is also used as a measure of performance of a regression and it will be used later to compare the baseline model to the other models created.

4.2.2 Scaled Baseline Model¶

The first improvement I want to implement to the baseline model, is to scale the predictive variables. This step is advisable becasue the three variables are expressed in three different units (i.e. number of orders, GBP and number of items). Scaling the variables makes sure that they are all centred around 0 and will have the same standard deviation.

# Scale the predicting variables because they are on different scales
# (the first one is in orders, and the second one is in pounds.)

# create a list that includes the columns to scale
cols = ['total_cancelled_orders', 'average_price_of_items', 'total_quantity']

#scale columns by subtracting mean and dividing by standard deviation
scaled_for_reg = pd.DataFrame(scale(joined[cols]),
                   index=joined.index,
                   columns=cols)

#add column with variable to predict 
scaled_for_reg['total_orders_placed'] = joined['total_orders_placed']

#this is the second model with the scaled columns.
m2_scaled = sm.ols('total_orders_placed ~ total_cancelled_orders\
            + average_price_of_items + total_quantity',\
            scaled_for_reg).fit()

m2_scaled.summary()

4.2.2.1 Model Explaination¶

From the summary of the scaled, improved model I can infer that:

When total cancelled orders, average price of items, and total quantity is 0, total orders amount to 93.9.
Average price of items is still negatively associated with number of orders. In the scaled model, this association is slightly stronger, in fact, for a positive change of average price of items, the number of orders decreases by 1.78 orders as opposed to 0.01.
The model's R squared has not changed.

4.2.3 Predictive Checking¶

Since the aim of this regression model is to predict future number of orders places by a potential customer, it is important to look at the distribution of actual "total_orders_placed" VS the distribution of "total_orders_placed" predicted by the baseline model, and by the scaled baseline model.

To do so, I am using Seaborn's kdeplot() fuction which gives a smooth estimate of the distribution using kernel density estimation (VanderPlas, 2016). Using kdeplot() I plotted the distribution of actual values, the distribution of the values predicted by the baseline model, and the distribution of the values predicted by the scaled model. Finally, I have plotted the three curves in the same graph to assess if the regression models are "picking up" the overall shape of the data.

4.2.3.1 Distribution of Actual Values¶

#Distribution of actual number of oders
sns.kdeplot(joined['total_orders_placed'], shade=True, color = 'black')

<matplotlib.axes._subplots.AxesSubplot at 0x121ae3cc0>

4.2.3.2 Distribution of Baseline Model Fitted Values¶

#distribution of predicted numbers of orders
sns.kdeplot(m1_baseline.fittedvalues, shade=True, color='red')

<matplotlib.axes._subplots.AxesSubplot at 0x1228806a0>

4.2.3.3 Distribution of Scaled Model Fitted Values¶

#distribution of predicted numbers of orders second prediction 
sns.kdeplot(m2_scaled.fittedvalues, shade=True, color='green')

<matplotlib.axes._subplots.AxesSubplot at 0x121ca3710>

4.2.3.4 Comparison of Distributions¶

#make put all in the same graph
f, ax = plt.subplots(1, figsize=(10, 6))
sns.kdeplot(joined['total_orders_placed'], shade=True, ax=ax, label='actual', color= 'black')
sns.kdeplot(m2_scaled.fittedvalues, shade=True, ax=ax, label='scaled', color = 'orange')
sns.kdeplot(m1_baseline.fittedvalues, shade=True, ax=ax, label='baseline', color = 'red')
plt.show()

From the graph above, it is possible to understand that scaled and baseline model's fitted values (red and orange graphs) have the same distribution and that they one captures the overall shape of the actual values (black graph), however the values are overestimated by the regression models.

To get a better distribution match, I am going to further improve the model by adding a categorical variable as a predictor in the new model.

4.3 Adding a Categorial Variable¶

Categorical variables can be added to regression models as dummy variables. Python's sm.ols() automatically recognise string variables and treats them as categorical.

For this project, I decided to create a variable "continent_of_origin" and use it as a predictor in the regression model. Since the majority of customers are based in europe, I have decided to sub-divide european countries according to european area (i.e. north, south, east and west europe). The classification of the countries was made according to the United Nation's official geographic regions composition (UN, n.d.).

# add "country_of_origin" variable to scaled dataset
scaled_for_reg['country_of_origin'] = joined['country_of_origin']

#create lists of continents. 
north_europe=['Iceland','Norway','Denmark','Sweden','Finland','Lithuania',\
             'EIRE','Channel Islands','United Kingdom']
east_europe=['Poland','Czech Republic']
south_europe=['Greece','Italy', 'Malta','Spain','Portugal',]
west_europe=['Switzerland', 'Austria','Belgium', 'Netherlands']
asia=['Cyprus','Bahrain','Israel','Saudi Arabia','United Arab Emirates','Singapore',\
      'Japan','Lebanon']
australia=['Australia']
africa=['RSA']
america=['USA','Brazil','Canada']

# Code country of origin by continent 

# Create a function that will assign "unspecified"to all variables unless
# The content of the row is in the list declared above. 

# Declare function
def continent(row):
    #assing all rows to unspecified 
    value = "unspecified"
    # but, if country is contained in the list north_europe, 
    if row['country_of_origin'] in north_europe:
        # give value europe and so on... 
        value = 'north europe'
    if row['country_of_origin'] in south_europe: 
        value = 'south europe'
    if row['country_of_origin'] in east_europe:
        value = 'east europe'
    if row['country_of_origin'] in west_europe:
        value = 'west europe'
    if row['country_of_origin'] in asia:
        value ='asia'
    if row['country_of_origin'] in australia:
        value='australia'
    if row['country_of_origin'] in africa:
        value = 'africa'
    if row['country_of_origin'] in america:
        value= 'america'
    #at the end, return value
    return value

# Create a new column 'continent_of_origin' and apply the function. 
scaled_for_reg['continent_of_origin'] = scaled_for_reg.apply(continent, axis=1)

scaled_for_reg.head()

# create new model
m3_improved = sm.ols('total_orders_placed ~ total_cancelled_orders\
            + average_price_of_items + continent_of_origin + total_quantity', scaled_for_reg ).fit()
# look at model summary
m3_improved.summary()

4.3.1 Model Explaination¶

From the summary of the model containing country of origin as a predictor, I can infer that:

The R squared value increases to 0.435
Coefficients suggest that being from the American continent, decreases the most the number of orders placed. While being from north europe decreases it the least, suggesting that the latter, brings the most customers (as observed above).

4.3.2 Predictive Checking¶

# graph baseline, scaled and categorical model
f, ax = plt.subplots(1, figsize=(15, 5))
# first plot 
sns.kdeplot(joined['total_orders_placed'], shade=True, ax=ax,\
            label='actual', color = 'black')
# second plot 
sns.kdeplot(m2_scaled.fittedvalues, shade=True, ax=ax, \
            label='scaled', color= 'orange')
# third plot 
sns.kdeplot(m3_improved.fittedvalues, shade=True, ax=ax,\
            label='with categorical', color='green')

plt.show()

The latest model, does not seem to improve the prediction, as the distribution curve has remained unchanged from the one of the scaled model.

Next, I will explore the models' performance, to have a better idea of their predictive power.

4.4 Model performance¶

To assess the model performance, I will compare the R squared, Mean Square Error (MSE), and Mean Absolute Error (MAE) of the baseline model with the measures of the latest model.

The MAE measures the average error in absolute terms of a set of forecasts. The MSE measures the average of the squares of the error in a set of forecasts. Along with R squared, MAE and MSE are generally used as a measure of model performance.

#R^2
r2 = pd.Series({'Baseline Model': metrics.r2_score(joined['total_orders_placed'],
                                              m1_baseline.fittedvalues),
                'Latest Model': metrics.r2_score(scaled_for_reg['total_orders_placed'],
                                              m3_improved.fittedvalues)})
r2

Baseline Model    0.432277
Latest Model      0.435298
dtype: float64

# MSE
mse = pd.Series({'Baseline Model': metrics.mean_squared_error(joined['total_orders_placed'],
                                                        m1_baseline.fittedvalues),
                 'Latest Model': metrics.mean_squared_error(scaled_for_reg['total_orders_placed'],
                                                         m3_improved.fittedvalues)})

# MAE
# taking the MAE between actual values and the values forecasted by the baseline model
mae = pd.Series({'Baseline Model': metrics.mean_absolute_error(joined['total_orders_placed'],
                                                         m1_baseline.fittedvalues),
                 #taking the MAE between actual values and the values forecasted by m3 model
                 'Latest Model': metrics.mean_absolute_error(scaled_for_reg['total_orders_placed'],
                                                          m3_improved.fittedvalues)})

perf = pd.DataFrame({'MAE': mae,
                     'MSE': mse,
                     'R^2': r2})
perf

From the comparison of MAE, MSE and R squared, It is important to note that R squared and MSE slightly improved from the baseline model and latest model, while MAE resulted higher in the latest model.

Since the aim of the regression model is to make predictions, it is appropriate to assess the validity of the model with cross-validation. Cross-validation is a validation technique for assessing how a model will perfom on a different and independent data set. The first step to cross-validation is to split the data set into train set and test set. The forecasting model is then trained on the train set and consequently validated on the test set. (Seni and Elder, 2010).

4.4.1 Cross-Validation¶

4.4.2 Split the Data into Train and Test set¶

# Split dataset into train (80%) and test set (20%)
# names of predictive variables
cols = ['total_cancelled_orders', 'average_price_of_items',\
            'total_quantity'] 

# create test and train sets 
x_train, x_test, y_train, y_test =\
                    model_selection.train_test_split(scaled_for_reg[cols], 
                                                    joined['total_orders_placed'],
                                                    test_size=0.8)

4.4.3 Train Model on Train Set¶

# Train data on train set
# f is a variable that includes the regression equation
f = 'total_orders_placed ~ total_cancelled_orders + average_price_of_items + total_quantity'

# this is the regression model
m1_train = sm.ols(f, x_train.assign(total_orders_placed=y_train)).fit()

4.4.4 Compare Parameters and R Squared¶

# create data frame containin the parameters of full set and of train set
pd.DataFrame({'Full Dataset': m1_baseline.params,
              'Train Set': m1_train.params})

# explore r quared of full data set and of train set 
pd.Series({'Full Dataset': m1_baseline.rsquared,
           'Train Set': m1_train.rsquared})

Full Dataset    0.432277
Train Set       0.440381
dtype: float64

4.4.5 Predict Test Set Values and Explore R Squared¶

# make predictions on test set. 
test_prediction = m1_train.params['Intercept'] + \
            x_test.dot(m1_train.params.drop('Intercept'))

#create a series with the r squared of baseline, train set data and test set data
pd.Series({'0-Full Dataset': m1_baseline.rsquared,
           '1-Train Set': m1_train.rsquared,
           '2-Test Set': metrics.r2_score(y_test,
                                        test_prediction)})

0-Full Dataset    0.432277
1-Train Set       0.440381
2-Test Set       -0.378059
dtype: float64

From the comparisons of R Squared values of the full data, train data set and test data, it is possible to note that R squared increases from the full data to the train data, but it decreases, and becomes negative, for the test data.

This is an example of "overfitting". Overfitting happens when a model describes the random error in the data rather than the relationships between variables (Frost, 2017). This means that the model created is a good fit to make preditions in the specific dataset used for analysis (R squared is 0.43), however it has no forecasting power over any other independent data set (R squared is -0.37).

4.5 Improve the model: Random forest¶

The Random Forest is a machine learning algorithm that is used on classification or regression problems. It is one of the most used algorithms because of its semplicity and effectiveness.

# run random forest regression. we can fit and predict in the same line
m4_RandomF = RandomForestRegressor().fit(joined[cols], joined['total_orders_placed'])\
                            .predict(joined[cols])

4.5.1 Assess Performance¶

# examine MAE MSE and R suqared of FR 
rf = pd.Series({'R^2': metrics.r2_score(joined['total_orders_placed'], m4_RandomF), 
                'MSE': metrics.mean_squared_error(joined['total_orders_placed'], m4_RandomF),
                'MAE': metrics.mean_absolute_error(joined['total_orders_placed'], m4_RandomF)})

# add RF performance indicators to the table of baseline and m3 model
pd.concat([perf, 
           pd.DataFrame({'Random Forest': rf}).T])

From the table above, it is important to note that the Random Forest regressor performs considerably better than the baseline model and the latest model with scaled variables and a categorical predictor.

Its MAE went from 68 to 22, MSE went from 30,000 to 6,000 and finally, its R squared value is 0.90, meaning that the model picks up almost 90% of the variation of the predicted variable.

4.5.2 Random Forest Cross Validated¶

Just as I did for the previous model, I am going to assess the performace of the Random Forest model with cross-validation to see if this model can be used to make predictions.

# previously the model was fitted on the same dataset where the predictions were made.
# in this case, we are making the predictions on the text set

# random forest cross validation. 
m4_cv = RandomForestRegressor().fit(x_train, y_train).predict(x_test)

# get MSE, MAE e R2 for cross validated RF
rf_cv = pd.Series({'R^2': metrics.r2_score(y_test, m4_cv), 
                'MSE': metrics.mean_squared_error(y_test, m4_cv),
                'MAE': metrics.mean_absolute_error(y_test, m4_cv)})
pd.concat([perf, 
           pd.DataFrame({'RF': rf}).T,
           pd.DataFrame({'RF-CV': rf_cv}).T])

As expected, the cross-validated Random Forest regressor performance are much worst compared to the non-cross-validated performance measures. This once again, is a sign of overfitting and suggests that the regression model has little predictive power over an independent dataset.

5.0 Conclusion and Reccomendations

In conlusion, this project was carried out using a dataset containing 541,909 data points corresponding to e-commerce transactions of an online UK retailer. The aim was to segment customers and to create a regression model that could forecast the number of orders placed by each customer.

To carry out the project, I first pre-processed the data by inspecting it, transforming existing variables, creating new variables, dealing with missing values, grouping the data at customer level and finally performing descriptive statistics and data visualization. The second step was then to segment customers into categories, to do so, I have used k-means cluster analysis. From this, I was able to divide customers into three main classes:

The Abituals
The New-comers
The Big Spenders

After performing clustering analysis, I have modeled a multiple linear regression with the aim of forecasting the number of orders placed by each customers. I first create a baseline model, which was then improved by scaling the predictive variables and by adding a categorical variable as predictor. Model performance was assessed with and without cross-validation and this showed strong signs of overfitting. To overcome this problem and to improve the predictive power of the model, I have used the Random Forest algorithm and assessed its performance with and without cross-validation. Once again, the performance indicators suggested that the regression model has little predictive power over an independent dataset.

To solve this problem, for future research it is advisable to join the current dataset with information about the demographics and shopping habits of each customer. This will add valuable information that can be used by the clustering algorithm to further segment customers and by the regression model to spot meaningful relationships within the variables in the data.

Bibliography.¶

Archive.ics.uci.edu. (2015). UCI Machine Learning Repository: Online Retail Data Set. [online] Available at: http://archive.ics.uci.edu/ml/datasets/online+retail# [Accessed 14 May 2018].

Chen, J., Ching, R. and Lin, Y. (2004). An extended study of the K-means algorithm for data clustering and its applications. Journal of the Operational Research Society, 55(9), pp.976-987.

Frost, J. (2017). Overfitting Regression Models: Problems, Detection, and Avoidance - Statistics By Jim. [online] Statistics By Jim. Available at: http://statisticsbyjim.com/regression/overfitting-regression-models/ [Accessed 14 May 2018].

Gove, A. (2017). Using the elbow method to determine the optimal number of clusters for k-means clustering. [online] Bl.ocks.org. Available at: https://bl.ocks.org/rpgove/0060ff3b656618e9136b [Accessed 14 May 2018].

Kaggle.com. (2018). E-Commerce Data | Kaggle. [online] Available at: https://www.kaggle.com/carrie1/ecommerce-data [Accessed 14 May 2018]. Trochim 2006 (https://www.socialresearchmethods.net/kb/statdesc.htm)

Koehrsen, W. (2018). Visualizing Data with Pairs Plots in Python – Towards Data Science. [online] Towards Data Science. Available at: https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166 [Accessed 14 May 2018].

Raschka, S. (2014). About Feature Scaling and Normalization. [online] Dr. Sebastian Raschka. Available at: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html [Accessed 14 May 2018].

Seni, G. and Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1), pp.1-126.

Swaminathan, S. (2018). Linear Regression — Detailed View – Towards Data Science. [online] Towards Data Science. Available at: https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86 [Accessed 14 May 2018].

UN (2018). UNSD — Methodology. [online] Unstats.un.org. Available at: https://unstats.un.org/unsd/methodology/m49/ [Accessed 14 May 2018].

VanderPlas, J. (2016). Python data science handbook. 1st ed.

Ye, L., Qiuru, C., Haixu, X., Yijun, L. and Guangping, Z. (2013). Customer Segmentation for Telecom with the k-means Clustering Method. Information Technology Journal, 12(3), pp.409-413.

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
5	536365	22752	SET 7 BABUSHKA NESTING BOXES	2	12/1/2010 8:26	7.65	17850.0	United Kingdom
6	536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER	6	12/1/2010 8:26	4.25	17850.0	United Kingdom
7	536366	22633	HAND WARMER UNION JACK	6	12/1/2010 8:28	1.85	17850.0	United Kingdom
8	536366	22632	HAND WARMER RED POLKA DOT	6	12/1/2010 8:28	1.85	17850.0	United Kingdom
9	536367	84879	ASSORTED COLOUR BIRD ORNAMENT	32	12/1/2010 8:34	1.69	13047.0	United Kingdom

	Quantity	UnitPrice	CustomerID
count	541909.000000	541909.000000	406829.000000
mean	9.552250	4.611114	15287.690570
std	218.081158	96.759853	1713.600303
min	-80995.000000	-11062.060000	12346.000000
25%	1.000000	1.250000	13953.000000
50%	3.000000	2.080000	15152.000000
75%	10.000000	4.130000	16791.000000
max	80995.000000	38970.000000	18287.000000

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Cancel
141	C536379	D	Discount	-1	2010-12-01 09:41:00	27.50	14527.0	United Kingdom	1
154	C536383	35004C	SET OF 3 COLOURED FLYING DUCKS	-1	2010-12-01 09:49:00	4.65	15311.0	United Kingdom	1
235	C536391	22556	PLASTERS IN TIN CIRCUS PARADE	-12	2010-12-01 10:24:00	1.65	17548.0	United Kingdom	1

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Tra_Spending
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	15.30
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	22.00

	CustomerID	days_as_customer
0	12346.0	326.0
1	12347.0	367.0
2	12348.0	358.0
3	12349.0	19.0
4	12350.0	310.0

	Count
CustomerID	135080
Description	1454
InvoiceNo	0
StockCode	0
Quantity	0
InvoiceDate	0
UnitPrice	0
Country	0

	Quantity	Tra_Spending	Cancel
CustomerID
12346.0	0	0.00	1
12347.0	2458	4310.00	0
12348.0	2341	1797.24	0

	CustomerID	Earliest_Invoice	Latest_Invoice
0	12346.0	2011-01-18 10:01:00	2011-12-09 12:50:00
1	12347.0	2010-12-07 14:57:00	2011-12-09 12:50:00
2	12348.0	2010-12-16 19:09:00	2011-12-09 12:50:00

	total_quantity	total_spending	total_cancelled_orders	total_orders_placed	average_price_of_items	days_as_customer	country_of_origin
CustomerID
12346.0	-0.241593	-0.232636	-0.138209	-0.393813	-0.039807	0.852451	United Kingdom
12347.0	0.282003	0.289361	-0.274125	0.377212	-0.027119	1.198879	Iceland
12348.0	0.257080	-0.014967	-0.274125	-0.269592	-0.002433	1.122834	Finland
12349.0	-0.107179	-0.019774	-0.274125	-0.089687	0.017533	-1.741534	Italy
12350.0	-0.199628	-0.192136	-0.274125	-0.329561	-0.017650	0.717259	Norway

	total_quantity	total_spending	total_cancelled_orders	total_orders_placed	average_price_of_items	days_as_customer
mean	-6.566938e-18	2.462602e-17	-3.940163e-17	-4.104337e-18	8.208673e-19	-3.858076e-17
std	1.000116e+00	1.000116e+00	1.000116e+00	1.000116e+00	1.000116e+00	1.000116e+00

	0	1	2	3	4	5
total_quantity	8127.071429	426.318654	35328.571429	92827.4000	930.029528	60.00
total_spending	13288.101143	660.322741	59766.425714	192103.8540	1572.361954	649.50
total_cancelled_orders	22.414286	0.544343	104.571429	20.6000	1.521260	2.00
total_orders_placed	452.542857	44.839755	4286.428571	1013.4000	92.447638	5.00
average_price_of_items	4.020325	4.623149	3.841534	6.3574	3.954898	8055.78
days_as_customer	342.285714	90.743731	334.714286	360.0000	304.595669	182.00
average_spend_order	29.363188	14.726279	13.943175	189.5637	17.008136	129.90

Dep. Variable:	total_orders_placed	R-squared:	0.432
Model:	OLS	Adj. R-squared:	0.432
Method:	Least Squares	F-statistic:	1097.
Date:	Mon, 14 May 2018	Prob (F-statistic):	0.00
Time:	19:34:04	Log-Likelihood:	-28517.
No. Observations:	4328	AIC:	5.704e+04
Df Residuals:	4324	BIC:	5.707e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	46.7812	2.804	16.686	0.000	41.285	52.278
total_cancelled_orders	16.1972	0.396	40.880	0.000	15.420	16.974
average_price_of_items	-0.0141	0.021	-0.669	0.504	-0.056	0.027
total_quantity	0.0129	0.001	20.695	0.000	0.012	0.014

Omnibus:	7904.201	Durbin-Watson:	1.983
Prob(Omnibus):	0.000	Jarque-Bera (JB):	24238783.825
Skew:	12.984	Prob(JB):	0.00
Kurtosis:	368.700	Cond. No.	5.06e+03

Omnibus:	7913.861	Durbin-Watson:	1.985
Prob(Omnibus):	0.000	Jarque-Bera (JB):	24157473.530
Skew:	13.023	Prob(JB):	0.00
Kurtosis:	368.078	Cond. No.	271.

	No Customers per Cluster
0	140
1	1635
2	7
3	5
4	2540
5	1

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	93.9378	2.675	35.116	0.000	88.693	99.182
total_cancelled_orders	119.1708	2.915	40.880	0.000	113.456	124.886
average_price_of_items	-1.7888	2.675	-0.669	0.504	-7.033	3.456
total_quantity	60.3286	2.915	20.695	0.000	54.613	66.044

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	100.9845	175.679	0.575	0.565	-243.437	445.406
continent_of_origin[T.america]	-209.7612	185.246	-1.132	0.258	-572.938	153.416
continent_of_origin[T.asia]	-43.6005	179.159	-0.243	0.808	-394.844	307.644
continent_of_origin[T.australia]	-160.7937	185.255	-0.868	0.385	-523.989	202.402
continent_of_origin[T.east europe]	-45.5345	187.809	-0.242	0.808	-413.737	322.668
continent_of_origin[T.north europe]	-4.7807	175.701	-0.027	0.978	-349.246	339.684
continent_of_origin[T.south europe]	-15.3724	177.005	-0.087	0.931	-362.392	331.648
continent_of_origin[T.unspecified]	-23.6991	176.153	-0.135	0.893	-369.050	321.652
continent_of_origin[T.west europe]	-22.3453	177.103	-0.126	0.900	-369.557	324.867
total_cancelled_orders	120.0132	2.923	41.060	0.000	114.283	125.744
average_price_of_items	-1.8155	2.670	-0.680	0.497	-7.051	3.420
total_quantity	60.6905	2.931	20.704	0.000	54.944	66.437

	Full Dataset	Train Set
Intercept	46.781173	87.001987
total_cancelled_orders	16.197225	77.201597
average_price_of_items	-0.014149	-184.700028
total_quantity	0.012851	121.809338

	MAE	MSE	R^2
Baseline Model	67.967766	30941.743636	0.432277
Latest Model	68.615029	30777.065205	0.435298