In the U.K. in 2017, the food-to-go (FTG) sector was worth up to £ 17.4bn and it is forecasted to increase to £23.5bn in 2022 as consumers’ need of eating on-the-go does not shows signs of slowing down (IGD, 2017). This growth in the market represents an opportunity for FTG retailers and suppliers, because as the market grows, they have a chance of growing their market share and increase profits.
In order to grow in this dynamic environment, retailers must manage their products as effectively as possible. In recent years, researchers started to recommend category management (CM) as an effective way of managing thousands of retail items. CM is a management technique that aims at improving a retailer’s overall performance by dividing stock in small categories and treating them as independent entities (Musalem and Aburto, 2017). Categories are created by grouping highly correlated items together which are then considered to be substitutes or complement of each other (Gooner et al., 2011). This report aims at spotting relationships among ranges of food-to-go items of a major U.K. supermarket and at assessing whether these can be grouped into sub-categories.
The analysis is carried out on a weekly sales report for the week starting the 23rd and finishing the 29th of April 2018. The research questions that will be investigated are:
The report is structured as follows. The first section regards analysis and discussion. Here, it is explained how the data was pre-processed prior to analysis, how it is summarised with descriptive statistics and how it is analysed to answer the research questions. The last section of the report will address the limitation of the research and conclusion will be given.
# import packages
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import pysal as ps
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from collections import Counter
import collections
import itertools
from string import ascii_letters
%matplotlib inline
#read data
data = pd.read_csv('./tescoenco.csv')
data.head(3)
# keep columns of interest
data_sub= data.iloc[:,[0,1,3,4,10,11,13]]
#remane variables
data_sub.columns = ['analysis_level','product_group', 'item_name', 'price_change',\
'sales_vol_change', 'sales_volume', 'sales_value']
#check for missing data
def count_missing(data):
missing_count = pd.DataFrame(data.isnull().sum(), \
columns=['Count']).sort_values(by=['Count'],\
ascending=False)
return missing_count
#apply the function to the data
count_missing(data_sub)
# replace missing sales volume change and price change with 0%
cols = ['price_change','sales_vol_change']
for i in cols:
data_sub[i] = data_sub[i].fillna('0%')
# Replace cells containing 'Product Groups' with 'NaN'
data_sub['analysis_level']=data_sub['analysis_level'].replace({'Product Groups':np.nan})
# all rows contataining '-100%' indicate that the product was not sold during the week
# therefore, I am excluding them from the analysis by:
# turning all '-100&' into 'NaN'
# deleting all rows containing 'NaN'
data_sub['price_change']=data_sub['price_change'].replace({'-100.0%':np.nan})
data_sub['sales_vol_change']=data_sub['sales_vol_change'].replace({'-100.0%':np.nan})
# Drop all rows containing 'NaN' values
data_sub = data_sub.dropna()
# Reset the index of the dataset
data_sub = data_sub.reset_index(drop=True)
# delete % sign from 'price_change' column
for lab, row in data_sub.iterrows():
row['price_change'] = row['price_change'].rstrip('%')
#make column of type float to make it easier to analyse later
row['price_change'] = float(row['price_change'])
for lab, row in data_sub.iterrows():
#Â delete ',' and '%' signs from 'sales_vol_change' column
row['sales_vol_change'] = row['sales_vol_change'].replace(",", "")
row['sales_vol_change'] = row['sales_vol_change'].rstrip('%')
#make column of type float
row['sales_vol_change'] = float(row['sales_vol_change'])
for lab, row in data_sub.iterrows():
# delete ',' sign from 'sales_volume' column
row['sales_volume'] = row['sales_volume'].replace(",", "")
# make column of type float
row['sales_volume'] = float(row['sales_volume'])
for lab, row in data_sub.iterrows():
#Â delete ',' and 'Å“' signs from 'sales_value' column
row['sales_value'] = row['sales_value'].replace(",", "")
row['sales_value'] = row['sales_value'][1:]
#make column of type float
row['sales_vol_change'] = float(row['sales_vol_change'])
# delete first column of dataset
data_sub = data_sub.iloc[:,1:]
# make 'price_change','sales_vol_change' and 'sales_volume' columns of type 'numeric'
cols = ['price_change','sales_vol_change', 'sales_volume', 'sales_value']
data_sub[cols] =data_sub[cols].apply(pd.to_numeric, errors='coerce')
# delete code from product group
data_sub['product_group'] =data_sub['product_group'].astype(str).str[:-6]
# show pre-processed dataset, ready for analysis
data_sub.head()
# create a copy of the processed dataset
analysis_ready = data_sub
items = analysis_ready['item_name'].unique().shape
print('There are '+str(items[0]) + ' items in the dataset')
cate = analysis_ready['product_group'].unique().shape
print('There are '+str(cate[0]) + ' categories of items in the dataset')
# create dataframe with the value counts
lis = pd.DataFrame(analysis_ready['product_group'].value_counts())
lis=lis.reset_index(drop = False)
#Â make a bar graph
pic = analysis_ready['product_group'].value_counts()[:10].plot(kind='barh',
title= str(lis.iloc[0,0]).lower()+' is the top stocked range of the week')
pic.set_xlabel("Number of Items")
top_units_category.iloc[0,0]
top_units_category= data_sub[['sales_volume','product_group']]\
.groupby(['product_group']).sum().reset_index().sort_values\
(by='sales_volume', ascending= False)[:10]
plt.barh(top_units_category['product_group'], top_units_category['sales_volume'])
plt.xlabel('Units Sold')
plt.title(str(top_units_category.iloc[0,0]).lower()+' is the Best Selling Range of the Week')
plt.show()
top_category= data_sub[['sales_value','product_group']]\
.groupby(['product_group']).sum().reset_index().sort_values\
(by='sales_value', ascending= False)[:10]
plt.barh(top_category['product_group'], top_category['sales_value'])
plt.xlabel('Revenue')
plt.title( str(top_category.iloc[0,0]).lower()+' is the Best Selling Range of the Week')
plt.show()
#Â create dataframe grouped by product group by summing sales_value
top_category= data_sub[['sales_value','product_group']]\
.groupby(['product_group']).sum().reset_index()
# standard sandwiches sales
SS = top_category.loc[top_category['product_group']=='STANDARD SANDWICHES -', 'sales_value']
# total sales
Total = top_category['sales_value'].sum()
#Â standard sandwiches over total sales
SS_percentage_sales = pd.DataFrame((SS / Total)*100)
#show percentage
print('Standard Sandwitches account for '+str(SS_percentage_sales.iloc[0,0])[0:6]+'% of total sales')
top_item_units.iloc[0,0]
top_item_units= analysis_ready[['sales_volume','item_name']]\
.groupby(['item_name']).sum().reset_index().sort_values\
(by='sales_volume', ascending= False)[0:10]
pic1= top_item_units.set_index('item_name').plot(kind='barh',
title = str(top_item_units.iloc[0,0]).lower()+' is the Best Selling Item of the Week')
pic1.set_xlabel("Units Sold")
pic1.set_ylabel(" ")
c= analysis_ready[['sales_value','item_name']]\
.groupby(['item_name']).sum().reset_index().sort_values\
(by='sales_value', ascending= False)[0:7]
pic1= c.set_index('item_name').plot(kind='barh',
title = str(c.iloc[0,0]).lower()+' is the best selling item of the week')
pic1.set_xlabel("Revenue")
pic1.set_ylabel(" ")
To find this out, I am computing correlation matrix for 'change in price' between product ranges.
# Subset dataset to contain 'product_group' and 'price_change'. Name the subset 'price'
price = analysis_ready.iloc[:,[0,2]]
# Create a dataframe with range as columns and the price change as rows.
#create empty list 'lis'.
lis = []
#for each unique element of product_group (i.e. A, B, ... T)
for i in price['product_group'].unique():
# create a series containin price_change of A, B, ... T
x = price.loc[price['product_group'] == str(i), 'price_change']
# reset the index of the series
x = x.reset_index(drop = True)
# append series to empty list 'lis'
lis.append(x)
#concatenate the series in 'lis' together, to form a new dataframe: 'total_price'
total_price = pd.concat(lis, axis=1)
# rename columns of 'total_price'
cols = list(price['product_group'].unique())
total_price.columns = cols
# calculate correlation between price change
cor_price = total_price.corr(method = "pearson")
#show correlation matrix
cor_price
# Get top 5 positive correlation and top 5 negative correlations by:
# create a function that gets redundant pairs of correlation
def get_redundant_pairs(df):
#create empy set for reduntant pairs
pairs_to_drop = set()
#set columns to be the same of the dataframe on which the function will be run
cols = df.columns
# for each number of a range (from 0 to n of rows in dataframe)
for i in range(0, df.shape[1]):
#for each number in range (from 0 to (n of rows in dataframe + 1))
for j in range(0, i+1):
# add to pairs_to_drop set pairs of columns
pairs_to_drop.add((cols[i], cols[j]))
# return pairs_to_drop set
return pairs_to_drop
# Create a function that gives the n top positive correlations in a dataframe 'df'
# declare name of function
def get_top_pos_correlations(df, n=5):
# calculate correlation matrix of 'df' and unstack it (put it in one column)
au_corr = df.corr(method = 'pearson').unstack()
# use function 'get_redundant_pairs' to select redundant pairs of dataframe
labels_to_drop = get_redundant_pairs(df)
# drop redundant pairs and sort remaining pairs in a descending order
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
# return 'n' sorted values
return au_corr[0:n]
# Create a function that gives the n top positive correlations in a dataframe 'df'
# this function is the same as 'get_top_pos_correlations' however values are sorted in
# a ascending order.
def get_top_neg_correlations(df, n=5):
au_corr = df.corr(method ='pearson').unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=True)
return au_corr[0:n]
# Print results
print("Top positive correlations (sales volume change)")
print(get_top_pos_correlations(total_price, 5))
print(' ')
print("Top negative correlations (sales volume change)")
print(get_top_neg_correlations(total_price, 5))
#visualization of correlation
sns.set(style="white")
# create mask for the upper triangle
mask = np.zeros_like(cor_price, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Correlation Map of Price Changes")
# create a colormap
cmap = sns.diverging_palette(220, 4, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(cor_price, mask=mask, cmap=cmap, vmax=.7, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .6})
# generate scatterplots for each pair of price_change
sns.pairplot(total_price, kind='scatter', diag_kind="kde", dropna = False);
To find this out, I am computing correlation matrix for 'change in sales volume' between product ranges.
analysis_ready.head(2)
# subset dataset to contain 'product_group' and 'sales_vol_change'
volume = analysis_ready.iloc[:,[0,3]]
#create a dataframe with range as columns and the sales volume change as rows
# create empty list 'lis'
lis = []
# for each prodcut range
for i in volume['product_group'].unique():
# return series containin sales volume change
x = volume.loc[volume['product_group'] == str(i), 'sales_vol_change']
# reset index of the series
x = x.reset_index(drop = True)
# append series in empty list
lis.append(x)
#concatenate the series together to create a dataframe 'total_vol'
total_vol = pd.concat(lis, axis=1)
# rename columns of 'total_vol'
cols = list(volume['product_group'].unique())
total_vol.columns = cols
#correlation matrix
cor_vol = total_vol.corr(method= 'pearson')
#show correlation matrix
cor_vol
# Get top 5 positive correlation and top 5 negative correlations by:
def get_redundant_pairs(df):
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_pos_correlations(df, n=5):
au_corr = df.corr(method = 'pearson').unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
def get_top_neg_correlations(df, n=5):
au_corr = df.corr().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=True)
return au_corr[0:n]
print("Top positive correlations (sales volume change)")
print(get_top_pos_correlations(total_vol, 5))
print (' ')
print("Top negative correlations (sales volume change)")
print(get_top_neg_correlations(total_vol, 5))
#visualization of correlation
sns.set(style="white")
# create mask for the upper triangle
mask = np.zeros_like(cor_vol, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up picture
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Correlation Map of Sales Changes")
# create a colormap
cmap = sns.diverging_palette(220, 4, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(cor_vol, mask=mask, cmap=cmap, vmax=.8, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
#scatter plots of sales volume change
sns.pairplot(total_price, kind='scatter', diag_kind="kde", dropna = False);
To find this out I have modeled a multiple regression model whose dependent variable is "sales_vol_change" and the independent variables are "price_change", "product_group" and the interaction between the two.
# model the regression model with the sm.ols formula
m1 = sm.ols(formula = 'sales_vol_change ~ price_change*product_group',\
data = analysis_ready).fit()
# check results of regression
m1.summary()
To do so, I have:
analysis_ready.head()
# Subset data to only include sales volume change data for items in range R
sales_volume_change = analysis_ready.loc[analysis_ready['product_group'] == 'STANDARD SANDWICHES -',\
'sales_vol_change']
sales_volume_change = sales_volume_change.reset_index(drop=True)
# create a list containing all product ranges minus product range "R"
prices = list(analysis_ready['product_group'].unique())
# remove range "R" from list
prices.remove('STANDARD SANDWICHES -')
#Â create dataframe on which correlation will be calculated
# create empty list w
w = []
# for each range from B to T
for i in prices:
# create series containing price change for ranges A to T - excluding R
x = analysis_ready.loc[volume['product_group'] == str(i), 'price_change']
#Â reset index of series
x = x.reset_index(drop = True)
#append series to empty list w
w.append(x)
#concatenate the series together
total_w = pd.concat(w, axis=1)
# edit list to rename columns
string = '_price_change'
prices_edited = [x[:-4] + string for x in prices]
# remane columns of dataframe
total_w.columns = prices_edited
# show dataframe
total_w.head(10)
# add sales volume change data of range A to the dataframe
total_R = pd.concat([sales_volume_change, total_w], axis=1)
# create a correlation matrix and show how sales volume change of range A correlates to price change of ranges B to T
corr = pd.DataFrame(total_R.corr(method='pearson').iloc[:,0])
# show correlation coefficient
corr
#visualization of correlation
sns.set(style="white")
# create mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up picture
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Correlation Between Standard Sandwiches Sales Volume Change and Price Change of all Other Ranges")
# create a colormap
cmap = sns.diverging_palette(220, 4, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.35, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
The regression model shows that the price change of salads is statistically significant in predicting the sales change of standard sandwiches.
Intecept = -6.54 (when price change of salads is 0, sales change of standard sandwiches is -6.54)
price change of salads coefficient = 1.95(change in price of salads, affect sales change of sandwiches by 1.95)
# model linear regression model
m2 = sm.ols(formula = 'sales_vol_change ~ SALAD_price_change',\
data = total_R).fit()
m2.summary()
To conclude, this report attempted at identifying relationships among ranges of food to go products. The analysis was divided into three steps. The first was data cleaning and preprocessing, the second was descriptive statistics where standard sandwiches was nominated as the range generating the biggest portion of revenue (43.55%), the third, was analysis of correlations. The latter proved that certain ranges experience co-movements in prices and sales, and that the sales of standard sandwiches are related to the change in price of salads.