The technique that stood out the most to me was Spatial Interaction Modelling. This technique is used in social sciences to predict and describe behaviours that mimic gravitation interactions as described by Isaac Newton's law of gravity.
Three independent conditions are necessary for spatial interaction to occur:
Spatial interaction models seek to explain spatial flows between origins and destinations. In this example, the aim is to predict the flow of money into Sheffield grocery stores, given their location, and the available expenditure of the consumers in the modelled area.
The relationship between money flow, destinations and origins is given by the equation below:
$$ S_{ij} = A_iO_iW_j exp(-\beta C_{ij}) $$
Where,
$S_{ij}$ = flow of expected flow of supermarket $j$ from orgin $i$
$A_i$ = balancing factor that takes into account competition and ensures that demand is allocated to stores within the modelled region. it is calculated as:
$$ A_i = \frac{1}{\sum W_j exp (-\beta C_{ij})}$$
This analysis is an adaptation of Dr. Andy Newing's and Dr. Nick Hood's example projects at the CDRC course on Spacial Modelling for Retail Analytics. During the course the computations were made in Excel, I just applied the same models and ideas to a different dataset and modelled the analysis in Python.
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
from collections import Counter
from math import log, sqrt
import warnings
from mpl_toolkits.basemap import Basemap
from random import randint
from scipy.spatial import distance_matrix
from sklearn.metrics.pairwise import euclidean_distances
import scipy as sp
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
warnings.filterwarnings('ignore')
%matplotlib inline
# function to count missing data
def count_missing(data):
# creates a dataframe containing the sum of all missing values
missing_count = pd.DataFrame(data.isnull().sum(),\
# names column "Count" and sorts values in descending order
columns=['Count']).sort_values(by=['Count'],\
ascending=False)
return missing_count
I then integrated this data with information regarding the origins, or demand zoens. The demand zones are the wards of the city of Sheffield. Ward data is available from the Office for National Statistics.
# read geolytics data
data_points = pd.read_csv('./retailpoints.csv')
# copy data and see how dataset looks
retail_points = data_points
retail_points.head()
print('The dataset contains ', str(retail_points.shape[0]), ' rows and ', str(retail_points.shape[1]), ' columns')
column_names = retail_points.columns.values.tolist()
print('The dataset contains the following columns')
print(column_names)
# examine the type of data in the set
for i in range(retail_points.shape[1]):
print('the type of items in column ', str(i), ' is ', str(type(retail_points.iloc[0][i])))
# check for missing values
count_missing(retail_points)
# subset data to contain grocery stores in sheffield
city = 'Sheffield'
retail_points_shef = retail_points.loc[retail_points['town'] == city]
retail_points_shef = retail_points_shef.reset_index(drop = True)
print('There are ', str(retail_points_shef.shape[0]), ' supermarket stores in ', city)
# only keep the columns we are interested in
retail_points_shef = retail_points_shef.iloc[:, [0,1,8,9,10,11,12,18]]
retail_points_shef.head()
# map retail points in sheffield
# Extract the data we're interested in
lat = retail_points_shef['lat_wgs'].values
lon = retail_points_shef['long_wgs'].values
# How much to zoom from coordinates (in degrees)
zoom_scale = 0.02
# Setup the bounding box for the zoom and bounds of the map
bbox = [np.min(lat)-zoom_scale,np.max(lat)+zoom_scale,\
np.min(lon)-zoom_scale,np.max(lon)+zoom_scale]
plt.figure(figsize=(12,6))
# Define the projection, scale, the corners of the map, and the resolution.
m = Basemap(projection='merc',llcrnrlat=bbox[0],urcrnrlat=bbox[1],\
llcrnrlon=bbox[2],urcrnrlon=bbox[3],lat_ts=10,resolution='h')
m.drawcoastlines()
# draw parallels, meridians, and color boundaries
m.drawparallels(np.arange(bbox[0],bbox[1],(bbox[1]-bbox[0])/5),labels=[1,0,0,0])
m.drawmeridians(np.arange(bbox[2],bbox[3],(bbox[3]-bbox[2])/5),labels=[0,0,0,1],rotation=45)
m.drawmapboundary()
# build and plot coordinates onto map
x,y = m(lon,lat)
m.plot(x,y,'r*',markersize=5)
plt.title("Retail points in Sheffield")
plt.show()
# read postcode data
# This data will be used to assign postcodes to wardcodes.
postcodes = pd.read_csv('postcodes.csv')
# make a copy of the dataset
pcodes = postcodes
# select variables of interest (postcode and wardcode)
pcodes = pcodes.iloc[:,[2,6]]
# rename columns
pcodes = pcodes.rename(index=str, columns={"pcds": "postcode", "wd11cd": "ward_code"})
pcodes.head()
# read sheffield postcode data
sheffield_codes = pd.read_csv('./Sheffield_postcodes.csv')
# keep only postcodes that are still in use.
sheffield_codes = sheffield_codes.loc[sheffield_codes['In Use?']=='Yes',:]
# keep only columns that we need
sheffield_codes = sheffield_codes.iloc[:,[0,4,5]]
# rename dataset columns
sheffield_codes = sheffield_codes.rename(index=str, columns={"Postcode": "postcode", "Easting": "easting",
'Northing':'northing'})
# merge sheffield postcodes with ward dataset to get all wards in the city of sheffield
sheffield_codes = pd.merge(sheffield_codes, pcodes, on='postcode', how='inner')
# keep useful columns
sheffield_codes = sheffield_codes.iloc[:, [3,1,2]]
# more that one postcode is assigned to the same ward. group data by ward and take the mean of the
# location to get a location coordinate for the ward.
sheffield_codes = sheffield_codes.groupby('ward_code').mean().reset_index()
sheffield_codes.head(2)
print('There are ', str(sheffield_codes.shape[0]), ' demand zones in ', city)
# Read population data
# this gives data about the economically active population in the demand zone.
pop = pd.read_csv('./unemploy.csv')
population = pop
population = population.iloc[:, [1,5]]
population = population.iloc[1:]
population = population.rename(index=str, columns={"GEO_CODE": "ward_code",
"F244": "economically active"})
population.head()
At this point I need to generate data regarding the demand zones. Through the census data I was able to know how many people are economically active in each ward. An economically active individual is either employed or is looking for employment.
I am assuming that each economically active individual of each ward will spend at least 10 pounds in supermarkets each week. After generating this assumption I can calculate the expenditure of each demand zone, and the total expenditure that the model will allocate to the supermarkets.
# merge sheffield wardcodes with population data to get information about population in each ward.
demand = pd.merge(sheffield_codes, population, on='ward_code', how='inner')
# convert population column to float
demand['economically active'] = demand['economically active'].astype(float)
# I am assuming that each economically active individual in the deman zone will spend 10 pounds per week on food.
demand['expenditure'] = demand['economically active']*10
demand = demand.iloc[:,[0,1,2,4]]
demand = demand.set_index('ward_code')
demand.head()
print('The total expenditure is ', str(demand['expenditure'].sum()), ' per week')
The supply data needs to have information about the location and size of the retailer. It would be ideal to have exact information about the size of the retail point, however, the Geolytics data only gives a range of values. For example, store X is smaller than 280 sqm, or between 280 and 1400. To solve this problem, I decided to generate a random integer withing those bounds and to assume that to be the store size.
# merge retail points data with postcode data to attach ward code to each store.
supply = pd.merge(retail_points_shef, pcodes, on='postcode', how='inner')
# keep only useful columns
supply_d = supply.iloc[:,[0,1,5,6,7]]
# rename columns
supply_d = supply_d.rename(index=str, columns={"id": "id", "retailer": "retailer",
'bng_e':'easting',
'bng_n': 'northing',
'size_sqm':'size'})
# what are the ranges of sizes we have?
supply_d['size'].unique()
# generate randon store sizes within the bounds of the ranges given.
def size(row):
#assing all rows to unspecified
value = 0
# but, if
if row['size'] == 'Between 280 and 1,400':
# give value
value = randint(280, 1400)
if row['size'] == 'Over 2,800':
value = randint(2800, 3500)
if row['size'] == 'Between 1,400 and 2,800':
value = randint(1400, 2800)
if row['size'] == 'Less than 280':
value = randint(100,280)
#at the end, return value
return value
# apply function
supply_d['size'] = supply_d.apply(size, axis=1)
supply_d = supply_d.set_index('id')
supply_d.head()
Now that the supply dataset is complete, let's look at how many retail points each retailer has in Sheffield.
The graph below shows that the retailer witht he highest number of retail points in The Co-operative Group with 25 stores withing the modelled area.
# bar graph showing which supermarkets have more retail points.
d = supply.groupby('retailer', as_index=False).count()
d = d.sort_values(by=['id'], ascending = False)
plotly.tools.set_credentials_file(username='giacopini', api_key='sAkPM0WuDb2Ah30ixDFx')
trace0 = go.Bar(
x= d['retailer'],
y= d['id'],
marker=dict(
color=['rgba(204,204,204,1)', 'rgba(204,204,204,1)',
'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
'rgba(204,204,204,1)','rgba(204,204,204,1)',
'rgba(204,204,204,1)','rgba(204,204,204,1)',
'rgba(204,204,204,1)','rgba(204,204,204,1)',
'rgba(204,204,204,1)']),
)
data = [trace0]
layout = go.Layout(
title='Retail Points by Retailer',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='text-hover-bar')
To recap, the expected flow of money between a demand zone and a destination is calculated by $$S_{ij} = A_iO_iW_j exp(-\beta C_{ij})$$
Where,
$S_{ij}$ = flow of expected revenue of the supermarket
$A_i$ = balancing factor that takes into account competition and ensures that demand is allocated to stores within the modelled region. it is calculated as:
$$ A_i = \frac{1}{\sum W_j exp (-\beta C_{ij})}$$
I will complete this analysis in 5 easy steps
This matrix contains the distance from demand zones to supermarkets. In this case we will calculate the euclidean distance.
$${\displaystyle d(\mathbf {p} ,\mathbf {q} )={\sqrt {(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}}}.}$$
# get coordinates od demand
coords_demand = demand.iloc[:,[0,1]]
# create a column of dataset where both coordinate are in a tuple
coords_demand['coords'] = coords_demand[['easting', 'northing']].apply(tuple, axis=1)
# discard columns where coordinates where separated.
coords_demand = coords_demand.iloc[:,2]
# the same process is done for the supply dataset
coords_supply = supply_d.iloc[:,[1,2]]
coords_supply['coords'] = coords_supply[['easting', 'northing']].apply(tuple, axis=1)
coords_supply = coords_supply.iloc[:,2]
# calculate euclidean distance and append to list
distances = []
for j in range(demand.shape[0]):
dist = []
for i in range(supply.shape[0]):
d = geopy.distance.vincenty(coords_demand.iloc[j],coords_supply.iloc[i]).km
dist.append(d)
distances.append(dist)
# distance [] is now a list of lists. I want to make it a dataframe to then convert it to matrix
df = pd.DataFrame(distances)
df = df.transpose()
# convert to matrix
Cij = df.as_matrix()
# this is the matrix of distances.
Cij
$W_j$ is the attractiveness of each retail point. In this project we are assuming that the attractiveness of a store is given by its size.
Wj = np.array(supply_d.iloc[:,3])
Wj
$O_i$ is the expenditure available to consumers.
Oi = np.array(demand.iloc[:,2])
Oi
$A_i$ is given by:
$$ A_i = \frac{1}{\sum W_j exp (-\beta C_{ij})}$$
I am choosing a values of $\beta$ = 0.1
beta = 0.001
# take exp of distance matrix and mulptiply by negative beta
exp = np.exp(-beta*Cij)
# multiply exponential by w
w = exp.T * Wj
# sum across
s = w.sum(axis=1)
# take inverse of sum
Ai = 1/s
Ai
Sij = w.T*(Ai*Oi)
Sij = Sij.sum(axis=1)
supply['expected revenue'] = Sij
supply.head()
# has all money been allocated?
total_revenue = Sij.sum()
total_demand = Oi.sum()
print('the total revenue is ', total_revenue, ' the total money household have to spend on food is ', total_demand )
# bar graph to show which retailer is expected the higher weekly revenue
total_revs = supply.groupby('retailer', as_index=False).sum()
# sort from higher expected revenue to lower
total_revs = total_revs.sort_values(by=['expected revenue'], ascending = False)
trace0 = go.Bar(
x= total_revs['retailer'],
y= total_revs['expected revenue'],
marker=dict(
color=['rgba(222,45,38,0.8)', 'rgba(204,204,204,1)',
'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
'rgba(204,204,204,1)','rgba(204,204,204,1)',
'rgba(204,204,204,1)','rgba(204,204,204,1)',
'rgba(204,204,204,1)','rgba(204,204,204,1)',
'rgba(204,204,204,1)']),
)
data = [trace0]
layout = go.Layout(
title='Retailer Weekly Expected Revenue',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='color-bar')
We can see that The Co-operatvie Group is the retailer that can expect the highest weekly revenue. This can be due to the fact that they have 25 retail points in Sheffield. The picture below shows that The Co-operative Group can expect to receive the 24.2% of the expenditure available in the modelled area (Sheffield).
It is also important to point out how Morrisons, with just 2 stores in the city was able to attract 12% of total demand. This is possibly due to the fact that the retailer has attractive stores in attractive locations.
# pie chart
labels = total_revs['retailer']
values = total_revs['expected revenue']
trace = go.Pie(labels=labels, values=values)
py.iplot([trace], filename='basic_pie_chart')
Thank you to the CDRC and the lecturers for introducting me to SIM, a technique I will definitely take into account when working on my own retailing research.