Prediction of Grocery Store Sales Using Spatial Interaction Modelling (SIM)


This project is inspired by the Spatial Modelling for Retail Analytics course offered by the Consumer Data Research Center (CDRC) I recently attended. The course was held by Dr. Andy Newing and Dr. Nick Hood. Andy and Nick did a great job at introducing the class to sophisticated analysis and spatial modelling approaches that can be used to understand retail sector dynamics. Please check out all the CDRC courses and events here.

The technique that stood out the most to me was Spatial Interaction Modelling. This technique is used in social sciences to predict and describe behaviours that mimic gravitation interactions as described by Isaac Newton's law of gravity.

Three independent conditions are necessary for spatial interaction to occur:

  • Complementarity: There must be supply and demand between the interacting locations. In the specific case of retailing, the supply side are the retail points, while the deman side is the available expenditure of customers.
  • Intervening opportunity (lack of): Refers to a location that may offer a better alternative as a point of origin or as a point of destination. In retail for example, a customer visits a store becasue he/she must not have a closer store that offers similar array of goods.
  • Transferability: Persons being transferred must be supported by transport infrastructures implying that origin and destination must be linked. the cost to overcome distance must not be higher than the benefits of the interaction.

Spatial interaction models seek to explain spatial flows between origins and destinations. In this example, the aim is to predict the flow of money into Sheffield grocery stores, given their location, and the available expenditure of the consumers in the modelled area.

The relationship between money flow, destinations and origins is given by the equation below:

$$ S_{ij} = A_iO_iW_j exp(-\beta C_{ij}) $$

Where,

  • $S_{ij}$ = flow of expected flow of supermarket $j$ from orgin $i$

  • $A_i$ = balancing factor that takes into account competition and ensures that demand is allocated to stores within the modelled region. it is calculated as:

$$ A_i = \frac{1}{\sum W_j exp (-\beta C_{ij})}$$

  • $W_j$ = represents the overall attractiveness of store j. In this case size is used as a proxy of attractiveness.
  • $exp(-\beta C_{ij})$ = represents the distance deterrence term. $\beta$ is between 0 and 1. $\beta$ = 0 means that the consumer does not mind travelling. $\beta$ = 1 means that the consumer does not want to travel. $C_ij$ represents the distance between origin and supermarket.

Important:

This analysis is an adaptation of Dr. Andy Newing's and Dr. Nick Hood's example projects at the CDRC course on Spacial Modelling for Retail Analytics. During the course the computations were made in Excel, I just applied the same models and ideas to a different dataset and modelled the analysis in Python.

In [744]:
# Import packages

import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
from collections import Counter
from math import log, sqrt
import warnings
from mpl_toolkits.basemap import Basemap
from random import randint
from random import seed
from random import random
from random import uniform
from scipy.spatial import distance_matrix
from sklearn.metrics.pairwise import euclidean_distances
import scipy as sp
from geopy import distance
import itertools

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

warnings.filterwarnings('ignore')
%matplotlib inline
In [745]:
seed(1234)
In [746]:
# function to count missing data
def count_missing(data):
    # creates a dataframe containing the sum of all missing values
    missing_count = pd.DataFrame(data.isnull().sum(),\
                                 # names column "Count" and sorts values in descending order
                                 columns=['Count']).sort_values(by=['Count'],\
                                                                ascending=False)
    return missing_count

The Data


The data used for this project was downloaded from Geolytix. The dataset contains over 10,000 UK supermarket roof-top locations size information. Retailers include Aldi, Asda, Booths, Budgens, Co-op, Costco, Lidl, Marks and Spencer, Morrisons, Sainsbury’s, Tesco, Waitrose, Wholefoods.

I then integrated this data with information regarding the origins, or demand zoens. The demand zones are the wards of the city of Sheffield. Ward data is available from the Office for National Statistics.

In [747]:
# read geolytics data 
data_points = pd.read_csv('./retailpoints.csv')
In [748]:
# copy data and see how dataset looks 
retail_points = data_points
retail_points.head()
Out[748]:
id retailer fascia store_name add_one add_two town locality postcode long_wgs lat_wgs bng_e bng_n pqi open_date close_date size_band size_sqft size_sqm
0 1010000001 Aldi Aldi Aldi Aberdeen Unit 9-10 Boulevard Retail Park NaN Aberdeen Central Aberdeen AB11 5EJ -2.081728 57.148626 395153.8100 806418.6600 Rooftop geocoded by Geolytix 0 0 3,013 < 15,069 ft2 (280 < 1,400 m2) Between 3,013 and 15,069 Between 280 and 1,400
1 1010000002 Aldi Aldi Aldi Westhill Westhill Drive NaN Westhill NaN AB32 6FY -2.278341 57.154810 383260.1400 807138.5400 Rooftop geocoded by Geolytix 0 0 3,013 < 15,069 ft2 (280 < 1,400 m2) Between 3,013 and 15,069 Between 280 and 1,400
2 1010000003 Aldi Aldi Aldi Ellon Ythan Terrace NaN Ellon NaN AB41 9LJ -2.075009 57.363314 395586.8100 830317.9900 Rooftop geocoded by Geolytix 0 0 3,013 < 15,069 ft2 (280 < 1,400 m2) Between 3,013 and 15,069 Between 280 and 1,400
3 1010000004 Aldi Aldi Aldi Inverurie Loco Works Road NaN Inverurie NaN AB51 4FY -2.380156 57.288771 377183.9088 822081.1258 Rooftop geocoded by Geolytix 0 0 3,013 < 15,069 ft2 (280 < 1,400 m2) Between 3,013 and 15,069 Between 280 and 1,400
4 1010000005 Aldi Aldi Aldi Hatfield 11 Parkhouse Court Comet Way Hatfield Hatfield AL10 9RQ -0.240950 51.763152 521491.8500 208626.7200 Rooftop geocoded by Geolytix 0 0 3,013 < 15,069 ft2 (280 < 1,400 m2) Between 3,013 and 15,069 Between 280 and 1,400
In [749]:
print('The dataset contains ', str(retail_points.shape[0]), ' rows and ', str(retail_points.shape[1]), ' columns')
The dataset contains  13859  rows and  19  columns
In [750]:
column_names = retail_points.columns.values.tolist()
print('The dataset contains the following columns')
print(column_names)
The dataset contains the following columns
['id', 'retailer', 'fascia', 'store_name', 'add_one', 'add_two', 'town', 'locality', 'postcode', 'long_wgs', 'lat_wgs', 'bng_e', 'bng_n', 'pqi', 'open_date', 'close_date', 'size_band', 'size_sqft', 'size_sqm']
In [751]:
# examine the type of data in the set
for i in range(retail_points.shape[1]):
    print('the type of items in column ', str(i), ' is ', str(type(retail_points.iloc[0][i])))
the type of items in column  0  is  <class 'numpy.int64'>
the type of items in column  1  is  <class 'str'>
the type of items in column  2  is  <class 'str'>
the type of items in column  3  is  <class 'str'>
the type of items in column  4  is  <class 'str'>
the type of items in column  5  is  <class 'float'>
the type of items in column  6  is  <class 'str'>
the type of items in column  7  is  <class 'str'>
the type of items in column  8  is  <class 'str'>
the type of items in column  9  is  <class 'numpy.float64'>
the type of items in column  10  is  <class 'numpy.float64'>
the type of items in column  11  is  <class 'numpy.float64'>
the type of items in column  12  is  <class 'numpy.float64'>
the type of items in column  13  is  <class 'str'>
the type of items in column  14  is  <class 'numpy.int64'>
the type of items in column  15  is  <class 'numpy.int64'>
the type of items in column  16  is  <class 'str'>
the type of items in column  17  is  <class 'str'>
the type of items in column  18  is  <class 'str'>
In [752]:
# check for missing values
count_missing(retail_points)
Out[752]:
Count
add_two 9717
locality 3955
id 0
bng_e 0
size_sqft 0
size_band 0
close_date 0
open_date 0
pqi 0
bng_n 0
long_wgs 0
lat_wgs 0
retailer 0
postcode 0
town 0
add_one 0
store_name 0
fascia 0
size_sqm 0

Filter


For this project I decided to use supermarkets located in the city of Sheffield.

In [753]:
# subset data to contain grocery stores in sheffield 
city = 'Sheffield'
retail_points_shef = retail_points.loc[retail_points['town'] == city]
retail_points_shef = retail_points_shef.reset_index(drop = True)
In [754]:
print('There are ', str(retail_points_shef.shape[0]), ' supermarket stores in ', city)
There are  125  supermarket stores in  Sheffield
In [755]:
# only keep the columns we are interested in 
retail_points_shef = retail_points_shef.iloc[:, [0,1,8,9,10,11,12,18]]

Show Sheffield Retail Points in Map

In [756]:
# map retail points in sheffield
# Extract the data we're interested in
lat = retail_points_shef['lat_wgs'].values
lon = retail_points_shef['long_wgs'].values

# How much to zoom from coordinates (in degrees)
zoom_scale = 0.02

# Setup the bounding box for the zoom and bounds of the map
bbox = [np.min(lat)-zoom_scale,np.max(lat)+zoom_scale,\
        np.min(lon)-zoom_scale,np.max(lon)+zoom_scale]

plt.figure(figsize=(12,6))

# Define the projection, scale, the corners of the map, and the resolution.
m = Basemap(projection='merc',llcrnrlat=bbox[0],urcrnrlat=bbox[1],\
            llcrnrlon=bbox[2],urcrnrlon=bbox[3],lat_ts=10,resolution='h')

m.drawcoastlines()

# draw parallels, meridians, and color boundaries
m.drawparallels(np.arange(bbox[0],bbox[1],(bbox[1]-bbox[0])/5),labels=[1,0,0,0])
m.drawmeridians(np.arange(bbox[2],bbox[3],(bbox[3]-bbox[2])/5),labels=[0,0,0,1],rotation=45)
m.drawmapboundary()

# build and plot coordinates onto map
x,y = m(lon,lat)
m.plot(x,y,'r*',markersize=5)
plt.title("Retail points in Sheffield")
plt.show()

Data Linkage


It is now time to obtain data regarding the demand side of the analysis. The information we need to obtain is:

  • How many demand zones are there and what are their locations?
  • What are the available expenditure of the demand zones?
In [757]:
# read postcode data
# This data will be used to assign postcodes to wardcodes. 
postcodes = pd.read_csv('postcodes.csv')
In [758]:
# make a copy of the dataset 
pcodes = postcodes
# select variables of interest (postcode and wardcode)
pcodes = pcodes.iloc[:,[2,6]]
# rename columns
pcodes = pcodes.rename(index=str, columns={"pcds": "postcode", "wd11cd": "ward_code"})
pcodes.head()
Out[758]:
postcode ward_code
0 AL1 1AG E05004802
1 AL1 1AJ E05004802
2 AL1 1AR E05004789
3 AL1 1AS E05004799
4 AL1 1BH E05004802
In [759]:
# read sheffield postcode data 
sheffield_codes = pd.read_csv('./Sheffield_postcodes.csv')
In [760]:
# keep only postcodes that are still in use. 
sheffield_codes = sheffield_codes.loc[sheffield_codes['In Use?']=='Yes',:]
# keep only columns that we need 
sheffield_codes = sheffield_codes.iloc[:,[0,4,5]]
# rename dataset columns
sheffield_codes = sheffield_codes.rename(index=str, columns={"Postcode": "postcode", "Easting": "easting",
                                                            'Northing':'northing'})
In [761]:
# merge sheffield postcodes with ward dataset to get all wards in the city of sheffield 
sheffield_codes = pd.merge(sheffield_codes, pcodes, on='postcode', how='inner')
# keep useful columns 
sheffield_codes = sheffield_codes.iloc[:, [3,1,2]]
# more that one postcode is assigned to the same ward. group data by ward and take the mean of the
# location to get a location coordinate for the ward.
sheffield_codes = sheffield_codes.groupby('ward_code').mean().reset_index()
sheffield_codes.head(2)
Out[761]:
ward_code easting northing
0 E05001020 441185.00000 389923.00000
1 E05001039 437160.90379 384844.64723
In [762]:
print('There are ', str(sheffield_codes.shape[0]), ' demand zones in ', city)
There are  30  demand zones in  Sheffield
In [763]:
# Read population data 
# this gives data about the economically active population in the demand zone. 
pop = pd.read_csv('./unemploy.csv')
In [764]:
population = pop
population = population.iloc[:, [1,5]]
population = population.iloc[1:]
population = population.rename(index=str, columns={"GEO_CODE": "ward_code", 
                                     "F244": "economically active"})
population.head()
Out[764]:
ward_code economically active
1 95AA01 2473
2 95AA02 892
3 95AA03 840
4 95AA04 1660
5 95AA05 1362

Create a "Demand" Dataset

At this point I need to generate data regarding the demand zones. Through the census data I was able to know how many people are economically active in each ward. An economically active individual is either employed or is looking for employment.

I am assuming that each economically active individual of each ward will spend at least 10 pounds in supermarkets each week. After generating this assumption I can calculate the expenditure of each demand zone, and the total expenditure that the model will allocate to the supermarkets.

In [765]:
# merge sheffield wardcodes with population data to get information about population in each ward.
demand = pd.merge(sheffield_codes, population, on='ward_code', how='inner')

# fucntion that returns demand
def get_demand_DF(data):
    data['economically active'] = data['economically active'].astype(float)
    data['expenditure'] = data['economically active']*10
    data = data.loc[:,['ward_code','easting','northing','expenditure']]
    data = data.set_index('ward_code')
    return data

demand_d = get_demand_DF(demand)

# I am assuming that each economically active individual in the deman zone will spend 10 pounds per week on food. 
demand_d.head()
Out[765]:
easting northing expenditure
ward_code
E05001020 441185.000000 389923.000000 62920.0
E05001039 437160.903790 384844.647230 89230.0
E05001040 434464.363344 381142.086817 87710.0
E05001041 443061.252308 383108.584615 98910.0
E05001042 439911.310954 383325.293286 83920.0
In [766]:
print('The total expenditure is ', str(demand_d['expenditure'].sum()), ' per week')
The total expenditure is  2782600.0  per week

Create a "Supply" Dataset

The supply data needs to have information about the location and size of the retailer. It would be ideal to have exact information about the size of the retail point, however, the Geolytics data only gives a range of values. For example, store X is smaller than 280 sqm, or between 280 and 1400. To solve this problem, I decided to generate a random integer withing those bounds and to assume that to be the store size.

In [767]:
# merge retail points data with postcode data to attach ward code to each store. 
supply = pd.merge(retail_points_shef, pcodes, on='postcode', how='inner')
In [768]:
def size(row):
    #assing all rows to unspecified 
    value = 0
    # but, if  
    if row['size'] == 'Between 280 and 1,400':
        # give value 
        value = randint(280, 1400)
    if row['size'] == 'Over 2,800': 
        value = randint(2800, 3500)
    if row['size'] == 'Between 1,400 and 2,800':
        value = randint(1400, 2800)
    if row['size'] == 'Less than 280':
        value = randint(100,280)
    #at the end, return value
    return value

# function that creates supply dataset 
def get_supply_DF(data):
    # get columns of interest
    data = data.loc[:,['id','retailer','bng_e','bng_n','size_sqm']]
    # rename columns 
    data = data.rename(index=str, columns={"id": "id", "retailer": "retailer",
                                            'bng_e':'easting',
                                            'bng_n': 'northing',
                                            'size_sqm':'size'})
    # add size with function above
    data['size'] = data.apply(size, axis=1)
    # set column 'id' as index of dataframe
    data = data.set_index('id')
    # reutn dataframe
    return data

# apply function 
supply_d = get_supply_DF(supply)
supply_d.head()  
Out[768]:
retailer easting northing size
id
1010000414 Aldi 435703.41 394469.17 1182
1010000423 Aldi 433918.19 383226.01 519
1010001067 Asda 438184.00 384928.50 295
1010001068 Asda 435749.93 385867.44 465
1010001070 Asda 433051.56 390683.61 351

Now that the supply dataset is complete, let's look at how many retail points each retailer has in Sheffield.

The graph below shows that the retailer witht he highest number of retail points in The Co-operative Group with 25 stores withing the modelled area.

In [769]:
# bar graph showing which supermarkets have more retail points. 
d = supply_d.groupby('retailer', as_index=False).count()
d = d.sort_values(by=['size'], ascending = False)
plotly.tools.set_credentials_file(username='giacopini', api_key='sAkPM0WuDb2Ah30ixDFx')

trace0 = go.Bar(
    x= d['retailer'],
    y= d['size'],
    marker=dict(
        color=['rgba(204,204,204,1)', 'rgba(204,204,204,1)',
               'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
               'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
               'rgba(204,204,204,1)','rgba(204,204,204,1)',
               'rgba(204,204,204,1)','rgba(204,204,204,1)',
               'rgba(204,204,204,1)','rgba(204,204,204,1)',
               'rgba(204,204,204,1)']),
)



data = [trace0]
layout = go.Layout(
    title= str(d[0:1]['retailer']).split('\n')[0].split('   ')[1]+\
    ' is the retailer with the highest number of retail points in Sheffield',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='text-hover-bar')
Out[769]:

Spatial Interaction Modelling

To recap, the expected flow of money between a demand zone and a destination is calculated by $$S_{ij} = A_iO_iW_j exp(-\beta C_{ij})$$

Where,

  • $S_{ij}$ = flow of expected revenue of the supermarket

  • $A_i$ = balancing factor that takes into account competition and ensures that demand is allocated to stores within the modelled region. it is calculated as:

$$ A_i = \frac{1}{\sum W_j exp (-\beta C_{ij})}$$

  • $W_j$ = represents the overall attractiveness of store j. In this case size is used as a proxy of attractiveness.
  • $exp(-\beta C_{ij})$ = represents the distance deterrence term. $\beta$ is between 0 and 1. $\beta$ = 0 means that the consumer does not mind travelling. $\beta$ = 1 means that the consumer does not want to travel. $C_ij$ represents the distance between demand zone and supermarket.

I will complete this analysis in 5 easy steps

1. Calculate distance matrix $C_{ij}$

This matrix contains the distance from demand zones to supermarkets. In this case we will calculate the euclidean distance.

$${\displaystyle d(\mathbf {p} ,\mathbf {q} )={\sqrt {(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}}}.}$$

In [770]:
def get_euclidean_disctance(demand_d, supply_d):
    # get easting and northing of demand zones
    coords_d = demand_d.loc[:,['easting','northing']]
    # make them a tuple 
    coords_d['coords'] = coords_d[['easting', 'northing']].apply(tuple, axis=1)
    # add tuple as a col of df 
    coords_d = coords_d.loc[:,'coords']
    
    # do the same for supply coordinates
    coords_s = supply_d.loc[:,['easting','northing']]
    coords_s['coords'] = coords_s[['easting', 'northing']].apply(tuple, axis=1)
    coords_s = coords_s.loc[:,'coords']
    
    # calculate distances
    distances = []
    for j in range(demand_d.shape[0]):
        dist = []
        for i in range(supply_d.shape[0]):
            # euclidean distance
            d = distance.vincenty(coords_d.iloc[j],coords_s.iloc[i]).km
            dist.append(d)
        distances.append(dist) 
    # create a df of distances
    df = pd.DataFrame(distances)
    df = df.transpose()
    # convert to matrix 
    Cij = df.as_matrix()
    # this is the matrix of distances. 
    return Cij

# call function
Cij = get_euclidean_disctance(demand_d, supply_d)
Cij
Out[770]:
array([[11579.27108757,  3860.08601596,  2354.13545013, ...,
        15058.76513885, 14423.12474376, 10801.25890367],
       [13002.83972318,  6594.27706577,  4201.59370656, ...,
        12198.04459469, 11134.06820425,  8623.7485972 ],
       [ 7485.44912617, 15358.11609151, 19007.36529306, ...,
         3920.74525381,  5248.2822813 ,  9967.13700189],
       ...,
       [ 9891.47674492,  3270.82567274,  4617.65623753, ...,
        13228.72711313, 13533.67819989, 12072.78118189],
       [ 7012.16926539, 14166.74766633, 17899.46487534, ...,
         1808.7926919 ,  4282.09260944, 10353.35099072],
       [ 5277.39209246, 12923.88767578, 17655.99278047, ...,
         6011.22017955,  7895.39624805, 12433.00585932]])

2. Get $W_j$

$W_j$ is the attractiveness of each retail point. In this project we are assuming that the attractiveness of a store is given by its size.

In [771]:
Wj = np.array(supply_d.iloc[:,3])
Wj
Out[771]:
array([1182,  519,  295,  465,  351,  451,  481, 1006,  764, 2817, 1463,
       2816,  188,  265, 1271,  257,  218,  138,  123,  146,  129,  103,
        228,  224,  163,  116,  270,  238,  219,  117,  253,  272,  462,
        231,  374,  832,  409,  265,  177,  223,  190,  168,  163,  267,
        151, 1208, 1365,  270,  319,  230,  424,  101,  185,  136,  221,
       1052,  622,  390,  119, 1296,  615,  434,  720,  575,  335,  989,
        128,  235,  569,  271,  905,  216,  361, 1132])

3. Get $O_i$

$O_i$ is the expenditure available to consumers.

In [772]:
Oi = np.array(demand_d.loc[:,'expenditure'])
Oi
Out[772]:
array([ 62920.,  89230.,  87710.,  98910.,  83920.,  79470., 104930.,
       171180.,  91600.,  98530.,  81090.,  96990.,  97170.,  86730.,
        80690., 109310.,  93370., 104160.,  97100.,  93620., 108050.,
        89900., 100280.,  89000.,  94300.,  94040., 104530.,  91290.,
        85290.,  17290.])

4. Calculate $A_i$

$A_i$ is given by:

$$ A_i = \frac{1}{\sum W_j exp (-\beta C_{ij})}$$

I am choosing a values of $\beta$ = 0.1

In [773]:
def get_Ai(Wj, Cij, beta = 0.001):
    # take exp of distance matrix and mulptiply by negative beta
    exp = np.exp(-beta*Cij)
    # multiply exponential by w
    w = exp.T * Wj
    # sum across
    s = w.sum(axis=1)
    # take inverse of sum 
    Ai = 1/s
    return Ai

Ai = get_Ai(Wj, Cij)
Ai
Out[773]:
array([0.0045832 , 0.00091437, 0.00355984, 0.00039778, 0.0082383 ,
       0.00044019, 0.00296396, 0.00048976, 0.00087273, 0.00817824,
       0.00645156, 0.00521571, 0.00127692, 0.00703391, 0.00896696,
       0.00044667, 0.00474796, 0.00127867, 0.01203539, 0.00703242,
       0.01566707, 0.00048271, 0.00632488, 0.00174476, 0.00592595,
       0.00712245, 0.00061569, 0.00352848, 0.00889034, 0.00885076])

5. Calculate flow of money $S_{ij}$

In [774]:
# calculate Sij which is the stream of money 
def get_Sij(Cij, Wj, Ai, Oi, beta = 0.001):
    exp = np.exp(-beta*Cij)
    w = exp.T * Wj
    Sij = w.T*(Ai*Oi)
    Sij = Sij.sum(axis=1)
    return Sij

Sij = get_Sij(Cij, Wj, Ai, Oi)
In [775]:
# add expected expenditure to dataframe
supply['expected revenue'] = Sij
In [776]:
# has all money been allocated? 
total_revenue = Sij.sum()
total_demand = Oi.sum()
print('the total revenue is ', total_revenue, ' the total money household have to spend on food is ', total_demand )
the total revenue is  2782600.0  the total money household have to spend on food is  2782600.0

Findings

In [777]:
# group data by retailer to get total revenue per retailer
total_revs = supply.groupby('retailer', as_index=False)['expected revenue'].sum()
# sort from higher expected revenue to lower
total_revs = total_revs.sort_values(by=['expected revenue'], ascending = False).reset_index()
In [778]:
total_revs['retailer'][0]
Out[778]:
'The Co-operative Group'
In [779]:
# bar graph to show which retailer is expected the higher weekly revenue

trace0 = go.Bar(
    x= total_revs['retailer'],
    y= total_revs['expected revenue'],
    marker=dict(
        color=['rgba(222,45,38,0.8)', 'rgba(204,204,204,1)',
               'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
               'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
               'rgba(204,204,204,1)','rgba(204,204,204,1)',
               'rgba(204,204,204,1)','rgba(204,204,204,1)',
               'rgba(204,204,204,1)','rgba(204,204,204,1)',
               'rgba(204,204,204,1)']),
)

data = [trace0]
layout = go.Layout(
    title= str(total_revs['retailer'][0]) + ' can expect the highest weekly revenue',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='color-bar')
Out[779]:

We can see that The Co-operatvie Group is the retailer that can expect the highest weekly revenue. This can be due to the fact that they have 25 retail points in Sheffield. The picture below shows that The Co-operative Group can expect to receive the 24.5% of the expenditure available in the modelled area (Sheffield).

It is also important to point out how Morrisons, with just 2 stores in the city was able to attract 10% of total demand. This is possibly due to the fact that the retailer has attractive stores in attractive locations.

In [780]:
# pie chart
labels = total_revs['retailer']
values = total_revs['expected revenue']

trace = go.Pie(labels=labels, values=values)

py.iplot([trace], filename='basic_pie_chart')
Out[780]:

Location Planning Simulation

Spatial interaction modelling is often times used for location planning. Let's assume that Morrisons wants to introduce a new store in sheffield and has a budget that will allow to create a 2,000 square meters store. The location planning team is undecided between possible locations:

  • Location 1
  • Location 2
  • Location 3

Let's see which location is the most profitable accoring to the Spatial Interatction Model

In [781]:
# randomly generate 3 locations withing the modelled area 
# get min and max easting and northing
min_easting = min(supply_d['easting'])
max_easting = max(supply_d['easting'])
min_northing= min(supply_d['northing'])
max_northing= max(supply_d['northing'])
size_supermarket = 2000

# generate n random location withing the modelled area
locations=[]
how_many_random_loc = 3
for i in range(how_many_random_loc):
    locations.append([uniform(min_easting,max_easting),\
                   uniform(min_northing, max_northing)])

# create rows to add to supply dataset
rows = []    
for i in locations: 
    rows.append(['Morrisons', i[0], i[1], size_supermarket])
In [782]:
# return list of expected revenue per location
def exp_revenue(location_rows, demand_d, supply_d):
    # initialise empty list 
    exp_revs = []
    total_revs_DF = []
    # for each location 
    for row in location_rows:
        # make row a series
        new = pd.Series(row, index=supply_d.columns)
        # create a copy of supply dataset
        supply_copy = supply_d.copy(deep=True)
        # append row to suppy dataset 
        supply_copy = supply_copy.append(new, ignore_index=True)
        # recalculate Cij, Wj, Oi, Ai, and Sij
        Cij = get_euclidean_disctance(demand_d, supply_copy)
        Wj = np.array(supply_copy.iloc[:,3])
        Oi = np.array(demand_d.loc[:,'expenditure'])
        Ai = get_Ai(Wj, Cij)
        Sij = get_Sij(Cij, Wj, Ai, Oi, beta = 0.001)
        # get expected revenue of new supermarket
        location_expected_rev = Sij[-1]
        # add expected revenue to dataset
        supply_copy['expected revenue'] = Sij
        # group by retailer 
        total_revs = supply_copy.groupby('retailer', as_index=False)['expected revenue'].sum()
        # append expected revenue to list 
        exp_revs.append(location_expected_rev)
        # expect grouped data to list of datasets
        total_revs_DF.append(total_revs)
    # return list of DF and list of expected revenues per location
    return total_revs_DF, exp_revs

# call function
dfs, exp_revs = exp_revenue(rows, demand_d, supply_d)
In [783]:
# create labels for locations 
labels = []
# print expected revenue per location.
for i in range(len(exp_revs)):
    print('The weekly revenue expected from location', i+1, 'is £', str(exp_revs[i])[:10])  
    labels.append('Location ' + str(i+1))
The weekly revenue expected from location 1 is £ 21355.2229
The weekly revenue expected from location 2 is £ 211068.481
The weekly revenue expected from location 3 is £ 117489.584
In [784]:
# bar graph all expected revenues per locations 
def plot_bar_x():
    # this is for plotting purpose
    index = np.arange(len(labels))
    plt.bar(index, exp_revs)
    plt.xlabel('locations', fontsize=5)
    plt.ylabel('expected weekly revenue', fontsize=12)
    plt.xticks(index, labels, fontsize=12, rotation=30)
    plt.title('')
    plt.show()
    
plot_bar_x()
In [785]:
# recalculate pie chart to see market share.
labels = dfs[np.argmax(exp_revs)]['retailer']
values = dfs[np.argmax(exp_revs)]['expected revenue']
trace = go.Pie(labels=labels, values=values)
py.iplot([trace], filename='basic_pie_chart')
Out[785]:

Conclusion


This project is an attempt at using spatial interaction modelling to predict revenue of Sheffield supermarkets and as a location planning tool. According to the data and the Spatial Interaction Model used, The Co-operative Group is the retialer that can expect the higherst weekly revenue in Sheffield with a market share of 24.5%.

For the location planning tool, I assumed that Morrison's supermarkets is considering to add another store in Sheffield. The retailer is condiring 3 locations and has a budjet to open a 2000 square meters store. The SIM used forecasted Location 2 to be the best option for Morrison's. Location 2 is in fact the one that can expect the highest weekly revenue moreover, a store in location 2 will increase Morrison's market share from 10% to 16.7%, making it the second supermarket in Sheffield in terms of expected weekly revenue.

This model can be improved with information regarding store sizes, actual available expenditure, actual data regarding deman zones.

Thank you to the CDRC and the lecturers for introducting me to SIM, a technique I will definitely take into account when working on my own retailing research.