Finding Correlations Between Ranges of Products¶

In the U.K. in 2017, the food-to-go (FTG) sector was worth up to £ 17.4bn and it is forecasted to increase to £23.5bn in 2022 as consumers’ need of eating on-the-go does not shows signs of slowing down (IGD, 2017). This growth in the market represents an opportunity for FTG retailers and suppliers, because as the market grows, they have a chance of growing their market share and increase profits.

In order to grow in this dynamic environment, retailers must manage their products as effectively as possible. In recent years, researchers started to recommend category management (CM) as an effective way of managing thousands of retail items. CM is a management technique that aims at improving a retailer’s overall performance by dividing stock in small categories and treating them as independent entities (Musalem and Aburto, 2017). Categories are created by grouping highly correlated items together which are then considered to be substitutes or complement of each other (Gooner et al., 2011). This report aims at spotting relationships among ranges of food-to-go items of a major U.K. supermarket and at assessing whether these can be grouped into sub-categories.

The analysis is carried out on a weekly sales report for the week starting the 23rd and finishing the 29th of April 2018. The research questions that will be investigated are:

Which product ranges experience the same movement in price?
Which product ranges experience the same movement in sales?
Is sales change sensitive to the change in price of a specific range?
Which price change affects the sales of the retailer’s best-selling range?

The report is structured as follows. The first section regards analysis and discussion. Here, it is explained how the data was pre-processed prior to analysis, how it is summarised with descriptive statistics and how it is analysed to answer the research questions. The last section of the report will address the limitation of the research and conclusion will be given.

# import packages
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import pysal as ps
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from collections import Counter
import collections
import itertools
from string import ascii_letters


%matplotlib inline

#read data
data = pd.read_csv('./tescoenco.csv')
data.head(3)

Data Pre-Processing¶

# keep columns of interest
data_sub= data.iloc[:,[0,1,3,4,10,11,13]]

#remane variables
data_sub.columns = ['analysis_level','product_group', 'item_name', 'price_change',\
               'sales_vol_change', 'sales_volume', 'sales_value']

#check for missing data
def count_missing(data):
    missing_count = pd.DataFrame(data.isnull().sum(), \
                                 columns=['Count']).sort_values(by=['Count'],\
                                                                ascending=False)
    return missing_count
#apply the function to the data 
count_missing(data_sub)

# replace missing sales volume change and price change with 0%
cols = ['price_change','sales_vol_change']

for i in cols: 
    data_sub[i] = data_sub[i].fillna('0%')

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """

# Replace cells containing 'Product Groups' with 'NaN' 
data_sub['analysis_level']=data_sub['analysis_level'].replace({'Product Groups':np.nan})

# all rows contataining '-100%' indicate that the product was not sold during the week
# therefore, I am excluding them from the analysis by:
    # turning all '-100&' into 'NaN'
    # deleting all rows containing 'NaN'
data_sub['price_change']=data_sub['price_change'].replace({'-100.0%':np.nan})
data_sub['sales_vol_change']=data_sub['sales_vol_change'].replace({'-100.0%':np.nan})

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

# Drop all rows containing 'NaN' values
data_sub = data_sub.dropna()

# Reset the index of the dataset
data_sub = data_sub.reset_index(drop=True)

# delete % sign from 'price_change' column
for lab, row in data_sub.iterrows():
    row['price_change'] = row['price_change'].rstrip('%')
    #make column of type float to make it easier to analyse later
    row['price_change'] = float(row['price_change'])

for lab, row in data_sub.iterrows():
    # delete ',' and '%' signs from 'sales_vol_change' column
    row['sales_vol_change'] = row['sales_vol_change'].replace(",", "")
    row['sales_vol_change'] = row['sales_vol_change'].rstrip('%')
    #make column of type float 
    row['sales_vol_change'] = float(row['sales_vol_change'])

for lab, row in data_sub.iterrows():
    # delete ',' sign from 'sales_volume' column
    row['sales_volume'] = row['sales_volume'].replace(",", "")
    # make column of type float
    row['sales_volume'] = float(row['sales_volume'])

for lab, row in data_sub.iterrows():
    # delete ',' and 'œ' signs from 'sales_value' column
    row['sales_value'] = row['sales_value'].replace(",", "")
    row['sales_value'] = row['sales_value'][1:]
    #make column of type float 
    row['sales_vol_change'] = float(row['sales_vol_change'])

# delete first column of dataset
data_sub = data_sub.iloc[:,1:]

# make 'price_change','sales_vol_change' and 'sales_volume' columns of type 'numeric'
cols = ['price_change','sales_vol_change', 'sales_volume', 'sales_value']
data_sub[cols] =data_sub[cols].apply(pd.to_numeric, errors='coerce')

# delete code from product group 
data_sub['product_group'] =data_sub['product_group'].astype(str).str[:-6]

# show pre-processed dataset, ready for analysis
data_sub.head()

Analysis¶

# create a copy of the processed dataset
analysis_ready = data_sub

Descriptive Statistics¶

How many are there in the dataset?¶

items = analysis_ready['item_name'].unique().shape
print('There are '+str(items[0]) + ' items in the dataset')

There are 836 items in the dataset

How many categories of food-to-go are there in the data?¶

cate = analysis_ready['product_group'].unique().shape
print('There are '+str(cate[0]) + ' categories of items in the dataset')

There are 18 categories of items in the dataset

How many items per category were stocked stocked?¶

# create dataframe with the value counts 
lis = pd.DataFrame(analysis_ready['product_group'].value_counts())
lis=lis.reset_index(drop = False)
# make a bar graph 
pic = analysis_ready['product_group'].value_counts()[:10].plot(kind='barh',
                                                         title= str(lis.iloc[0,0]).lower()+' is the top stocked range of the week')
pic.set_xlabel("Number of Items")

Text(0.5,0,'Number of Items')

Which Range Sells the Most Units?¶

top_units_category.iloc[0,0]

'STANDARD SANDWICHES -'

top_units_category= data_sub[['sales_volume','product_group']]\
                .groupby(['product_group']).sum().reset_index().sort_values\
                (by='sales_volume', ascending= False)[:10]
        

plt.barh(top_units_category['product_group'], top_units_category['sales_volume'])
plt.xlabel('Units Sold')
plt.title(str(top_units_category.iloc[0,0]).lower()+' is the Best Selling Range of the Week')
plt.show()

Which range brought the most revenue?¶

top_category= data_sub[['sales_value','product_group']]\
                .groupby(['product_group']).sum().reset_index().sort_values\
                (by='sales_value', ascending= False)[:10]

plt.barh(top_category['product_group'], top_category['sales_value'])
plt.xlabel('Revenue')
plt.title( str(top_category.iloc[0,0]).lower()+' is the Best Selling Range of the Week')
plt.show()

What percentage of Verenue does "standard sandwich" account for?¶

# create dataframe grouped by product group by summing sales_value
top_category= data_sub[['sales_value','product_group']]\
                .groupby(['product_group']).sum().reset_index()

# standard sandwiches sales
SS = top_category.loc[top_category['product_group']=='STANDARD SANDWICHES -', 'sales_value']
# total sales
Total = top_category['sales_value'].sum()

# standard sandwiches over total sales
SS_percentage_sales = pd.DataFrame((SS / Total)*100)

#show percentage
print('Standard Sandwitches account for '+str(SS_percentage_sales.iloc[0,0])[0:6]+'% of total sales')

Standard Sandwitches account for 43.557% of total sales

Which Item sold the Most Units?¶

top_item_units.iloc[0,0]

'TESCO CHEESE & ONION BRWN BRD S/WICH'

top_item_units= analysis_ready[['sales_volume','item_name']]\
                .groupby(['item_name']).sum().reset_index().sort_values\
                (by='sales_volume', ascending= False)[0:10]

pic1= top_item_units.set_index('item_name').plot(kind='barh',
                                    title = str(top_item_units.iloc[0,0]).lower()+' is the Best Selling Item of the Week')
pic1.set_xlabel("Units Sold")
pic1.set_ylabel(" ")

Text(0,0.5,' ')

Which Item brought the most revenue?¶

c= analysis_ready[['sales_value','item_name']]\
                .groupby(['item_name']).sum().reset_index().sort_values\
                (by='sales_value', ascending= False)[0:7]

pic1= c.set_index('item_name').plot(kind='barh',
                                    title = str(c.iloc[0,0]).lower()+' is the best selling item of the week')
pic1.set_xlabel("Revenue")
pic1.set_ylabel(" ")

Text(0,0.5,' ')

Testing my Hipothesis Questions¶

Question 1: "Which Product Ranges Have Co-movement in Prices?"¶

To find this out, I am computing correlation matrix for 'change in price' between product ranges.

# Subset dataset to contain 'product_group' and 'price_change'. Name the subset 'price'
price = analysis_ready.iloc[:,[0,2]]

# Create a dataframe with range as columns and the price change as rows.

#create empty list 'lis'.
lis = []
#for each unique element of product_group (i.e. A, B, ... T)
for i in price['product_group'].unique():
    # create a series containin price_change of A, B, ... T
    x = price.loc[price['product_group'] == str(i), 'price_change']
    # reset the index of the series
    x = x.reset_index(drop = True)
    # append series to empty list 'lis'
    lis.append(x)

#concatenate the series in 'lis' together, to form a new dataframe: 'total_price'
total_price = pd.concat(lis, axis=1)

# rename columns of 'total_price'
cols = list(price['product_group'].unique())
total_price.columns = cols

# calculate correlation between price change
cor_price = total_price.corr(method = "pearson")

#show correlation matrix
cor_price

# Get top 5 positive correlation and top 5 negative correlations by: 

# create a function that gets redundant pairs of correlation
def get_redundant_pairs(df):
    #create empy set for reduntant pairs
    pairs_to_drop = set()
    #set columns to be the same of the dataframe on which the function will be run
    cols = df.columns
    # for each number of a range (from 0 to n of rows in dataframe)
    for i in range(0, df.shape[1]):
        #for each number in range (from 0 to (n of rows in dataframe + 1))
        for j in range(0, i+1):
            # add to pairs_to_drop set pairs of columns
            pairs_to_drop.add((cols[i], cols[j]))
    # return pairs_to_drop set
    return pairs_to_drop

# Create a function that gives the n top positive correlations in a dataframe 'df'

# declare name of function
def get_top_pos_correlations(df, n=5):
    # calculate correlation matrix of 'df' and unstack it (put it in one column)
    au_corr = df.corr(method = 'pearson').unstack()
    # use function 'get_redundant_pairs' to select redundant pairs of dataframe
    labels_to_drop = get_redundant_pairs(df)
    # drop redundant pairs and sort remaining pairs in a descending order
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    # return 'n' sorted values 
    return au_corr[0:n]

# Create a function that gives the n top positive correlations in a dataframe 'df'
# this function is the same as 'get_top_pos_correlations' however values are sorted in
# a ascending order.
def get_top_neg_correlations(df, n=5):
    au_corr = df.corr(method ='pearson').unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=True)
    return au_corr[0:n]

# Print results
print("Top positive correlations (sales volume change)")
print(get_top_pos_correlations(total_price, 5))

print('  ')

print("Top negative correlations (sales volume change)")
print(get_top_neg_correlations(total_price, 5))

Top positive correlations (sales volume change)
HOT FOOD TO GO        STANDARD ROLLS AND SUBS -    0.531286
BRANDED SANDWICHES -  INDULGENCE -                 0.490899
FINEST SANDWICHES -   ORANGE JUICE -               0.484421
FOS SNACKING -        STANDARD WRAPS -             0.458772
FOOD TO GO -          ORANGE JUICE -               0.450557
dtype: float64
  
Top negative correlations (sales volume change)
FOOD TO GO -                STANDARD WRAPS -             -0.698868
ORANGE JUICE -              PICKBOX -                    -0.531506
HEALTHY LIVING SANDWICHS -  ORANGE JUICE -               -0.482675
ORANGE JUICE -              STANDARD WRAPS -             -0.469706
BRANDED SANDWICHES -        HEALTHY LIVING SANDWICHS -   -0.311122
dtype: float64

#visualization of correlation
sns.set(style="white")

# create mask for the upper triangle
mask = np.zeros_like(cor_price, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Correlation Map of Price Changes")

# create a colormap
cmap = sns.diverging_palette(220, 4, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(cor_price, mask=mask, cmap=cmap, vmax=.7, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .6})

<matplotlib.axes._subplots.AxesSubplot at 0x1a1d163358>

# generate scatterplots for each pair of price_change
sns.pairplot(total_price, kind='scatter', diag_kind="kde", dropna = False);

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/numpy/core/_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/nonparametric/bandwidths.py:22: RuntimeWarning: invalid value encountered in minimum
  return np.minimum(np.std(X, axis=0, ddof=1), IQR)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/numpy/core/fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Question 2: "Which Product Ranges Have Co-movement in Sales?"¶

To find this out, I am computing correlation matrix for 'change in sales volume' between product ranges.

analysis_ready.head(2)

# subset dataset to contain 'product_group' and 'sales_vol_change'
volume = analysis_ready.iloc[:,[0,3]]

#create a dataframe with range as columns and the sales volume change as rows

# create empty list 'lis'
lis = []
# for each prodcut range
for i in volume['product_group'].unique():
    # return series containin sales volume change
    x = volume.loc[volume['product_group'] == str(i), 'sales_vol_change']
    # reset index of the series
    x = x.reset_index(drop = True)
    # append series in empty list 
    lis.append(x)

#concatenate the series together to create a dataframe 'total_vol'   
total_vol = pd.concat(lis, axis=1)

# rename columns of 'total_vol'
cols = list(volume['product_group'].unique())
total_vol.columns = cols

#correlation matrix 
cor_vol = total_vol.corr(method= 'pearson')

#show correlation matrix
cor_vol

# Get top 5 positive correlation and top 5 negative correlations by: 

def get_redundant_pairs(df):
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_pos_correlations(df, n=5):
    au_corr = df.corr(method = 'pearson').unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

def get_top_neg_correlations(df, n=5):
    au_corr = df.corr().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=True)
    return au_corr[0:n]


print("Top positive correlations (sales volume change)")
print(get_top_pos_correlations(total_vol, 5))

print ('  ')
print("Top negative correlations (sales volume change)")
print(get_top_neg_correlations(total_vol, 5))

Top positive correlations (sales volume change)
ORANGE JUICE -       SALAD BAR                    0.778424
FINEST SANDWICHES -  STANDARD PASTA -             0.493853
ORANGE JUICE -       SALADS -                     0.493301
PICKBOX -            SUSHI -                      0.429762
FOS SNACKING -       STANDARD ROLLS AND SUBS -    0.428328
dtype: float64
  
Top negative correlations (sales volume change)
HEALTHY LIVING SANDWICHS -  STANDARD SANDWICHES -   -0.703259
FINEST SANDWICHES -         ORANGE JUICE -          -0.641083
HOT FOOD TO GO              ORANGE JUICE -          -0.608032
FOS SNACKING -              HOT FOOD TO GO          -0.559196
FOS FRESH CUT FRUIT -       ORANGE JUICE -          -0.548939
dtype: float64

#visualization of correlation
sns.set(style="white")

# create mask for the upper triangle
mask = np.zeros_like(cor_vol, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up picture
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Correlation Map of Sales Changes")

# create a colormap
cmap = sns.diverging_palette(220, 4, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(cor_vol, mask=mask, cmap=cmap, vmax=.8, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x1a19d8b4e0>

#scatter plots of sales volume change
sns.pairplot(total_price, kind='scatter', diag_kind="kde", dropna = False);

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/numpy/core/_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/nonparametric/bandwidths.py:22: RuntimeWarning: invalid value encountered in minimum
  return np.minimum(np.std(X, axis=0, ddof=1), IQR)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/numpy/core/fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Question 3: "Is Sales Change Sensitive to the Price Change of Items of a Particular Range of Products?"¶

To find this out I have modeled a multiple regression model whose dependent variable is "sales_vol_change" and the independent variables are "price_change", "product_group" and the interaction between the two.

# model the regression model with the sm.ols formula
m1 = sm.ols(formula = 'sales_vol_change ~  price_change*product_group',\
            data = analysis_ready).fit()

# check results of regression
m1.summary()

Result Interpretation¶

Question 4: "Does the price change of items of other ranges affect the sales of items of standard sandwiches?"¶

To do so, I have:

Calculated the correlation between quantity of sales of A sandwitches and price changes of items of all other ranges
Modeled a regression model that calculates whether the price change of highly correlated ranges affects the sales of items in range "R".

analysis_ready.head()

# Subset data to only include sales volume change data for items in range R
sales_volume_change = analysis_ready.loc[analysis_ready['product_group'] == 'STANDARD SANDWICHES -',\
                                         'sales_vol_change']
sales_volume_change = sales_volume_change.reset_index(drop=True)

# create a list containing all product ranges minus product range "R"
prices = list(analysis_ready['product_group'].unique())
# remove range "R" from list 
prices.remove('STANDARD SANDWICHES -')

# create dataframe on which correlation will be calculated

# create empty list w
w = []
# for each range from B to T
for i in prices:
    # create series containing price change for ranges A to T - excluding R
    x = analysis_ready.loc[volume['product_group'] == str(i), 'price_change']
    # reset index of series 
    x = x.reset_index(drop = True)
    #append series to empty list w
    w.append(x)

#concatenate the series together     
total_w = pd.concat(w, axis=1)

# edit list to rename columns
string = '_price_change'
prices_edited = [x[:-4] + string for x in prices]

# remane columns of dataframe 
total_w.columns = prices_edited

# show dataframe 
total_w.head(10)

# add sales volume change data of range A to the dataframe 
total_R = pd.concat([sales_volume_change, total_w], axis=1)

# create a correlation matrix and show how sales volume change of range A correlates to price change of ranges B to T
corr = pd.DataFrame(total_R.corr(method='pearson').iloc[:,0])

# show correlation coefficient
corr

#visualization of correlation
sns.set(style="white")

# create mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up picture
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Correlation Between Standard Sandwiches Sales Volume Change and Price Change of all Other Ranges")

# create a colormap
cmap = sns.diverging_palette(220, 4, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.35, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x1a17f1df60>

Regression between sales volumen change of items of range A and price change of items in range H¶

The regression model shows that the price change of salads is statistically significant in predicting the sales change of standard sandwiches.

Intecept = -6.54 (when price change of salads is 0, sales change of standard sandwiches is -6.54)

price change of salads coefficient = 1.95(change in price of salads, affect sales change of sandwiches by 1.95)

# model linear regression model 
m2 = sm.ols(formula = 'sales_vol_change ~  SALAD_price_change',\
            data = total_R).fit()
m2.summary()

Conclusion¶

To conclude, this report attempted at identifying relationships among ranges of food to go products. The analysis was divided into three steps. The first was data cleaning and preprocessing, the second was descriptive statistics where standard sandwiches was nominated as the range generating the biggest portion of revenue (43.55%), the third, was analysis of correlations. The latter proved that certain ranges experience co-movements in prices and sales, and that the sales of standard sandwiches are related to the change in price of salads.

	Analysis Level	Product Group	TPN	Item Name	Average Price Per Unit (% Change)	Average Price Per Unit (Latest)	Customers (% Change)	Customers (Latest)	Frequency of Purchase (% Change)	Frequency of Purchase (Latest)	Sales Units (% Change)	Sales Units (Latest)	Sales Value (% Change)	Sales Value (Latest)	Spend Per Customer (% Change)	Spend Per Customer (Latest)	Stores Selling (% Change)	Stores Selling (Latest)	Units Per Visit (% Change)	Units Per Visit (Latest)
0	Product Groups	BRANDED SANDWICHES - F54TI	NaN	NaN	25.7%	œ2.64	1,183.3%	203,652	-21.1%	1.95	1,104.4%	1,221,931	1,414.4%	œ3,224,646	1.8%	œ5.75	2,491.3%	1,788	-2.6%	1.13
1	Products	BRANDED SANDWICHES - F54TI	55723253.0	DD'S KOSHER DEEP FLLD CHEESE& COLESLAW S/W	-100.0%	œ0.00	-100.0%	0	-100.0%	0.00	-100.0%	0	-100.0%	œ0	-100.0%	œ0.00	-100.0%	0	-100.0%	0.00
2	Products	BRANDED SANDWICHES - F54TI	55723420.0	DD'S KOSHER DEEP FLLD SALT BEEF SANDWICH	-0.3%	œ3.36	77.7%	1,390	0.4%	1.94	79.2%	7,956	78.6%	œ26,719	1.0%	œ6.84	-5.6%	17	1.2%	1.05

	product_group	item_name	price_change	sales_vol_change	sales_volume	sales_value
0	BRANDED SANDWICHES -	DD'S KOSHER DEEP FLLD SALT BEEF SANDWICH	-0.3	79.2	7956.0	26719
1	BRANDED SANDWICHES -	GOURMET FOODS CHEDDAR&PICKLE SANDWICH	7.7	23.0	1792.0	2789
2	BRANDED SANDWICHES -	GOURMET FOODS HAM SALAD&MAYO SANDWICH	-7.2	-1.7	925.0	1494
3	BRANDED SANDWICHES -	GOURMET FOODS CHICKEN,STUFFING&SAGE SANDWICH	1.3	-0.8	1417.0	2501
4	BRANDED SANDWICHES -	GOURMET FOODS TUNA&SWEETCORN SANDWICH	3.1	37.5	1941.0	3177

	BRANDED SANDWICHES -	FINEST SANDWICHES -	FOOD TO GO -	FOS FRESH CUT FRUIT -	FOS SNACKING -	HEALTHY LIVING SANDWICHS -	HOT FOOD TO GO	INDULGENCE -	ORANGE JUICE -	PICKBOX -	SALAD BAR	SALADS -	SINGLE SUSHI -	STANDARD PASTA -	STANDARD ROLLS AND SUBS -	STANDARD SANDWICHES -	STANDARD WRAPS -	SUSHI -
BRANDED SANDWICHES -	1.000000	-0.014007	0.323035	0.039133	-0.225083	-0.311122	-0.120911	0.490899	0.215522	-0.180426	-0.090388	-0.002699	NaN	-0.025603	-0.245747	0.009663	0.152989	0.109284
FINEST SANDWICHES -	-0.014007	1.000000	-0.121235	0.095500	-0.122116	-0.224002	0.077003	-0.175530	0.484421	-0.017827	0.062621	-0.295206	NaN	0.157086	0.016655	0.021593	-0.003816	-0.089854
FOOD TO GO -	0.323035	-0.121235	1.000000	-0.019421	-0.192396	-0.082656	0.090752	0.096524	0.450557	-0.014835	-0.071772	0.145523	NaN	-0.162793	-0.153450	0.104973	-0.698868	-0.060114
FOS FRESH CUT FRUIT -	0.039133	0.095500	-0.019421	1.000000	0.114055	-0.016955	0.032308	0.278032	0.222257	-0.267727	-0.122000	0.160790	NaN	-0.066798	-0.041455	0.042516	0.041497	0.097661
FOS SNACKING -	-0.225083	-0.122116	-0.192396	0.114055	1.000000	0.295209	0.375387	0.007147	-0.310297	-0.168084	0.205389	0.085398	NaN	0.024644	0.122329	-0.139246	0.458772	-0.045962
HEALTHY LIVING SANDWICHS -	-0.311122	-0.224002	-0.082656	-0.016955	0.295209	1.000000	0.116442	-0.003523	-0.482675	0.410881	0.080421	0.188352	NaN	0.346722	-0.203581	-0.222226	0.099933	-0.076468
HOT FOOD TO GO	-0.120911	0.077003	0.090752	0.032308	0.375387	0.116442	1.000000	0.036099	0.419463	0.030791	0.088202	-0.115493	NaN	0.097724	0.531286	0.023981	0.001030	-0.154140
INDULGENCE -	0.490899	-0.175530	0.096524	0.278032	0.007147	-0.003523	0.036099	1.000000	-0.140893	0.113969	-0.199467	-0.247395	NaN	-0.169698	0.110621	0.392936	0.294751	0.234241
ORANGE JUICE -	0.215522	0.484421	0.450557	0.222257	-0.310297	-0.482675	0.419463	-0.140893	1.000000	-0.531506	0.056988	-0.144346	NaN	0.040057	0.397197	0.210402	-0.469706	-0.064855
PICKBOX -	-0.180426	-0.017827	-0.014835	-0.267727	-0.168084	0.410881	0.030791	0.113969	-0.531506	1.000000	-0.194278	-0.097220	NaN	0.061493	-0.043785	0.180950	-0.139402	-0.045869
SALAD BAR	-0.090388	0.062621	-0.071772	-0.122000	0.205389	0.080421	0.088202	-0.199467	0.056988	-0.194278	1.000000	0.107981	NaN	0.068512	0.076184	0.029883	-0.001644	0.033356
SALADS -	-0.002699	-0.295206	0.145523	0.160790	0.085398	0.188352	-0.115493	-0.247395	-0.144346	-0.097220	0.107981	1.000000	NaN	0.007337	-0.155314	0.086953	0.040984	-0.103382
SINGLE SUSHI -	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
STANDARD PASTA -	-0.025603	0.157086	-0.162793	-0.066798	0.024644	0.346722	0.097724	-0.169698	0.040057	0.061493	0.068512	0.007337	NaN	1.000000	-0.257756	-0.004872	0.022922	-0.153918
STANDARD ROLLS AND SUBS -	-0.245747	0.016655	-0.153450	-0.041455	0.122329	-0.203581	0.531286	0.110621	0.397197	-0.043785	0.076184	-0.155314	NaN	-0.257756	1.000000	0.049447	0.040904	-0.118859
STANDARD SANDWICHES -	0.009663	0.021593	0.104973	0.042516	-0.139246	-0.222226	0.023981	0.392936	0.210402	0.180950	0.029883	0.086953	NaN	-0.004872	0.049447	1.000000	0.009022	0.049405
STANDARD WRAPS -	0.152989	-0.003816	-0.698868	0.041497	0.458772	0.099933	0.001030	0.294751	-0.469706	-0.139402	-0.001644	0.040984	NaN	0.022922	0.040904	0.009022	1.000000	-0.184640
SUSHI -	0.109284	-0.089854	-0.060114	0.097661	-0.045962	-0.076468	-0.154140	0.234241	-0.064855	-0.045869	0.033356	-0.103382	NaN	-0.153918	-0.118859	0.049405	-0.184640	1.000000

	BRANDED SANDWICHES -	FINEST SANDWICHES -	FOOD TO GO -	FOS FRESH CUT FRUIT -	FOS SNACKING -	HEALTHY LIVING SANDWICHS -	HOT FOOD TO GO	INDULGENCE -	ORANGE JUICE -	PICKBOX -	SALAD BAR	SALADS -	SINGLE SUSHI -	STANDARD PASTA -	STANDARD ROLLS AND SUBS -	STANDARD SANDWICHES -	STANDARD WRAPS -	SUSHI -
BRANDED SANDWICHES -	1.000000	0.180150	-0.083285	-0.099395	0.212882	0.330315	-0.060328	-0.100273	0.015398	-0.177786	0.097033	-0.091339	NaN	0.153660	-0.181045	0.161892	-0.161438	-0.202151
FINEST SANDWICHES -	0.180150	1.000000	-0.443400	-0.106632	0.127169	0.093467	-0.064790	0.205851	-0.641083	0.120001	-0.112141	-0.229412	NaN	0.493853	0.175313	-0.302165	-0.414226	0.223391
FOOD TO GO -	-0.083285	-0.443400	1.000000	0.099529	0.016628	0.037621	-0.134761	-0.073393	0.087234	-0.102342	0.116446	0.103715	NaN	-0.199997	-0.085658	-0.110684	0.143149	-0.164509
FOS FRESH CUT FRUIT -	-0.099395	-0.106632	0.099529	1.000000	0.032300	-0.309464	0.058826	0.099711	-0.548939	0.289095	-0.017597	0.068348	NaN	0.077025	-0.053623	0.258115	0.015155	-0.070378
FOS SNACKING -	0.212882	0.127169	0.016628	0.032300	1.000000	-0.389619	-0.559196	0.330337	-0.277482	-0.387269	0.254431	0.007602	NaN	-0.421288	0.428328	0.081676	0.108673	0.053560
HEALTHY LIVING SANDWICHS -	0.330315	0.093467	0.037621	-0.309464	-0.389619	1.000000	0.178967	-0.236641	-0.364018	-0.145866	-0.022236	-0.047907	NaN	0.258532	-0.336294	-0.703259	-0.154376	0.020877
HOT FOOD TO GO	-0.060328	-0.064790	-0.134761	0.058826	-0.559196	0.178967	1.000000	0.316181	-0.608032	0.294092	0.012209	0.204316	NaN	0.206871	-0.452816	-0.111467	-0.129741	-0.021752
INDULGENCE -	-0.100273	0.205851	-0.073393	0.099711	0.330337	-0.236641	0.316181	1.000000	-0.374928	0.075260	0.088756	-0.250061	NaN	0.257822	0.263941	-0.034803	0.012693	0.228406
ORANGE JUICE -	0.015398	-0.641083	0.087234	-0.548939	-0.277482	-0.364018	-0.608032	-0.374928	1.000000	0.252979	0.778424	0.493301	NaN	-0.314780	-0.133502	0.228982	0.314495	-0.324246
PICKBOX -	-0.177786	0.120001	-0.102342	0.289095	-0.387269	-0.145866	0.294092	0.075260	0.252979	1.000000	-0.329960	-0.170539	NaN	0.118869	-0.084924	-0.010885	-0.026271	0.429762
SALAD BAR	0.097033	-0.112141	0.116446	-0.017597	0.254431	-0.022236	0.012209	0.088756	0.778424	-0.329960	1.000000	-0.026302	NaN	-0.108926	-0.205929	-0.022538	-0.026345	-0.144009
SALADS -	-0.091339	-0.229412	0.103715	0.068348	0.007602	-0.047907	0.204316	-0.250061	0.493301	-0.170539	-0.026302	1.000000	NaN	-0.185079	-0.424116	-0.017232	0.331589	0.192723
SINGLE SUSHI -	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
STANDARD PASTA -	0.153660	0.493853	-0.199997	0.077025	-0.421288	0.258532	0.206871	0.257822	-0.314780	0.118869	-0.108926	-0.185079	NaN	1.000000	-0.132394	0.061751	-0.435584	-0.009796
STANDARD ROLLS AND SUBS -	-0.181045	0.175313	-0.085658	-0.053623	0.428328	-0.336294	-0.452816	0.263941	-0.133502	-0.084924	-0.205929	-0.424116	NaN	-0.132394	1.000000	0.158550	0.363924	0.046441
STANDARD SANDWICHES -	0.161892	-0.302165	-0.110684	0.258115	0.081676	-0.703259	-0.111467	-0.034803	0.228982	-0.010885	-0.022538	-0.017232	NaN	0.061751	0.158550	1.000000	-0.020102	-0.233710
STANDARD WRAPS -	-0.161438	-0.414226	0.143149	0.015155	0.108673	-0.154376	-0.129741	0.012693	0.314495	-0.026271	-0.026345	0.331589	NaN	-0.435584	0.363924	-0.020102	1.000000	-0.029201
SUSHI -	-0.202151	0.223391	-0.164509	-0.070378	0.053560	0.020877	-0.021752	0.228406	-0.324246	0.429762	-0.144009	0.192723	NaN	-0.009796	0.046441	-0.233710	-0.029201	1.000000

Dep. Variable:	sales_vol_change	R-squared:	0.040
Model:	OLS	Adj. R-squared:	0.001
Method:	Least Squares	F-statistic:	1.013
Date:	Thu, 01 Nov 2018	Prob (F-statistic):	0.448
Time:	10:52:37	Log-Likelihood:	-10178.
No. Observations:	860	AIC:	2.043e+04
Df Residuals:	825	BIC:	2.059e+04
Df Model:	34
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	10.0765	3742.061	0.003	0.998	-7335.003	7355.156
product_group[T.FINEST SANDWICHES -]	-10.2211	7481.402	-0.001	0.999	-1.47e+04	1.47e+04
product_group[T.FOOD TO GO -]	-29.6014	9848.117	-0.003	0.998	-1.94e+04	1.93e+04
product_group[T.FOS FRESH CUT FRUIT -]	-27.4037	6289.410	-0.004	0.997	-1.24e+04	1.23e+04
product_group[T.FOS SNACKING -]	55.6212	7390.078	0.008	0.994	-1.44e+04	1.46e+04
product_group[T.HEALTHY LIVING SANDWICHS -]	-0.3275	9987.607	-3.28e-05	1.000	-1.96e+04	1.96e+04
product_group[T.HOT FOOD TO GO]	257.9780	4408.342	0.059	0.953	-8394.907	8910.863
product_group[T.INDULGENCE -]	-41.3694	1.18e+04	-0.004	0.997	-2.31e+04	2.3e+04
product_group[T.ORANGE JUICE -]	936.3399	1.09e+04	0.086	0.932	-2.05e+04	2.24e+04
product_group[T.PICKBOX -]	1097.9574	7664.830	0.143	0.886	-1.39e+04	1.61e+04
product_group[T.SALAD BAR]	1.863e+04	5705.026	3.265	0.001	7430.369	2.98e+04
product_group[T.SALADS -]	456.0308	6036.703	0.076	0.940	-1.14e+04	1.23e+04
product_group[T.SINGLE SUSHI -]	-60.0765	3.43e+04	-0.002	0.999	-6.74e+04	6.72e+04
product_group[T.STANDARD PASTA -]	-7.9646	7161.646	-0.001	0.999	-1.41e+04	1.4e+04
product_group[T.STANDARD ROLLS AND SUBS -]	4.3935	7941.211	0.001	1.000	-1.56e+04	1.56e+04
product_group[T.STANDARD SANDWICHES -]	-7.5799	4979.192	-0.002	0.999	-9780.955	9765.796
product_group[T.STANDARD WRAPS -]	-28.5994	7084.857	-0.004	0.997	-1.39e+04	1.39e+04
product_group[T.SUSHI -]	-48.9386	6131.309	-0.008	0.994	-1.21e+04	1.2e+04
price_change	1.5989	898.897	0.002	0.999	-1762.796	1765.994
price_change:product_group[T.FINEST SANDWICHES -]	-3.0952	946.457	-0.003	0.997	-1860.843	1854.653
price_change:product_group[T.FOOD TO GO -]	-2.5423	1100.714	-0.002	0.998	-2163.072	2157.988
price_change:product_group[T.FOS FRESH CUT FRUIT -]	-2.0879	914.631	-0.002	0.998	-1797.365	1793.189
price_change:product_group[T.FOS SNACKING -]	-18.1095	1901.878	-0.010	0.992	-3751.198	3714.979
price_change:product_group[T.HEALTHY LIVING SANDWICHS -]	-5.0304	2666.390	-0.002	0.998	-5238.737	5228.676
price_change:product_group[T.HOT FOOD TO GO]	-2.0685	899.391	-0.002	0.998	-1767.433	1763.296
price_change:product_group[T.INDULGENCE -]	-4.0520	2595.090	-0.002	0.999	-5097.808	5089.704
price_change:product_group[T.ORANGE JUICE -]	143.7736	1373.236	0.105	0.917	-2551.674	2839.221
price_change:product_group[T.PICKBOX -]	-127.1284	2932.565	-0.043	0.965	-5883.295	5629.038
price_change:product_group[T.SALAD BAR]	-326.5837	901.556	-0.362	0.717	-2096.198	1443.030
price_change:product_group[T.SALADS -]	-15.4386	1254.086	-0.012	0.990	-2477.013	2446.136
price_change:product_group[T.SINGLE SUSHI -]	-2.348e-13	1.6e-12	-0.147	0.883	-3.38e-12	2.91e-12
price_change:product_group[T.STANDARD PASTA -]	-2.4678	2578.806	-0.001	0.999	-5064.260	5059.324
price_change:product_group[T.STANDARD ROLLS AND SUBS -]	-6.9514	1396.962	-0.005	0.996	-2748.968	2735.066
price_change:product_group[T.STANDARD SANDWICHES -]	-1.3403	976.185	-0.001	0.999	-1917.440	1914.759
price_change:product_group[T.STANDARD WRAPS -]	-2.3515	1051.701	-0.002	0.998	-2066.675	2061.972
price_change:product_group[T.SUSHI -]	-7.2998	2757.042	-0.003	0.998	-5418.941	5404.342

Omnibus:	2243.887	Durbin-Watson:	2.002
Prob(Omnibus):	0.000	Jarque-Bera (JB):	22286644.863
Skew:	27.528	Prob(JB):	0.00
Kurtosis:	789.716	Cond. No.	1.24e+16

	BRANDED SANDWICH_price_change	FINEST SANDWICH_price_change	FOOD TO _price_change	FOS FRESH CUT FRU_price_change	FOS SNACKI_price_change	HEALTHY LIVING SANDWIC_price_change	HOT FOOD T_price_change	INDULGEN_price_change	ORANGE JUI_price_change	PICKB_price_change	SALAD_price_change	SALA_price_change	SINGLE SUS_price_change	STANDARD PAS_price_change	STANDARD ROLLS AND SU_price_change	STANDARD WRA_price_change	SUS_price_change
0	-0.3	-2.1	0.0	-6.3	0.0	0.0	-7.7	1.1	0.0	0.0	0.0	0.0	0.0	-1.1	2.6	-3.1	3.6
1	7.7	-3.7	39.7	3.7	0.0	3.3	-1.5	3.1	4.1	1.5	-17.9	0.9	NaN	-1.1	2.7	-2.1	0.0
2	-7.2	-2.6	0.0	0.0	0.0	-0.8	-4.3	-8.0	-7.5	-3.2	-3.8	0.0	NaN	0.4	0.6	-2.2	0.0
3	1.3	-4.3	0.0	0.0	0.0	4.5	-3.4	0.0	-14.4	5.2	11.4	0.0	NaN	4.1	0.0	-1.7	-0.3
4	3.1	0.0	0.0	4.0	0.0	0.8	-1.3	1.4	2.2	-0.1	-41.2	0.0	NaN	3.2	0.7	-1.8	0.1
5	-14.6	-2.9	0.0	-4.0	1.6	6.1	3.9	-1.8	-8.5	0.0	-30.7	0.0	NaN	3.1	0.0	-2.7	0.0
6	-5.5	-5.4	0.0	0.4	11.1	4.5	1.5	-3.2	-5.2	0.1	104.8	0.0	NaN	0.0	0.0	-2.6	2.6
7	-10.7	0.0	0.0	-4.7	0.9	11.6	4.8	-6.0	-10.2	4.0	-11.7	0.0	NaN	-0.1	0.0	-3.3	-0.6
8	-6.4	4.2	41.4	0.2	-1.3	0.0	0.0	-7.0	15.0	-0.4	9.2	1.5	NaN	-0.3	-3.5	-45.0	0.8
9	-4.7	0.0	0.0	-4.8	-1.7	0.0	27.8	-3.4	15.4	-2.8	31.8	-0.4	NaN	-1.7	27.1	-4.3	-0.7

	sales_vol_change
sales_vol_change	1.000000
BRANDED SANDWICH_price_change	0.096218
FINEST SANDWICH_price_change	0.214159
FOOD TO _price_change	-0.116824
FOS FRESH CUT FRU_price_change	0.024813
FOS SNACKI_price_change	-0.221470
HEALTHY LIVING SANDWIC_price_change	0.005505
HOT FOOD T_price_change	-0.073000
INDULGEN_price_change	0.207097
ORANGE JUI_price_change	0.172523
PICKB_price_change	-0.031493
SALAD_price_change	0.242796
SALA_price_change	-0.070915
SINGLE SUS_price_change	NaN
STANDARD PAS_price_change	0.200894
STANDARD ROLLS AND SU_price_change	-0.089984
STANDARD WRA_price_change	-0.129369
SUS_price_change	0.228208

Omnibus:	10.254	Durbin-Watson:	1.801
Prob(Omnibus):	0.006	Jarque-Bera (JB):	24.949
Skew:	0.222	Prob(JB):	3.82e-06
Kurtosis:	6.026	Cond. No.	62.9

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-0.7699	3.876	-0.199	0.843	-8.519	6.979
SALAD_price_change	0.1227	0.062	1.971	0.053	-0.002	0.247