Setting up a new business is an arduous task. Entrepreneurs often embark in new business ventures with high hopes of success, however research shows that a considerable portion of new firms exit the market soon after entering it (Fritsch et al. 2006). Understanding why this happens is crucial to the economy’s stability and health since business bankruptcy is costly not only to business owners and investors but also to the community as a whole (Pompe and Bilderbeek, 2005).
Previous success/fail model studies have focused on using firms’ specific characteristics to predict whether a business fails or succeeds (Lussier and Halabi, 2010). These characteristics include starting capital, industry and management experience of founder, education and age of founder and many others. Although these factors are important in determining success or failure of a business, Fritsch et al. (2006) argue that a limitation of said studies is that they do not account for the regional dimension. In fact, Fritsch et al. (2006) show that regional factors play an important role in predicting the survival of new businesses. This is in line with the concept of agglomeration economies, according to which cities and clusters of activity boost the productivity of firms located within them (Duranton and Kerr, 2015). Examples of industry clusters the Silicon Valley in the USA for tech firms, or the Sheffield (U.K.) area for cutlery manufacturing (Duranton and Overman, 2004).
The tendency of industries to agglomerate has been greatly studied by economists and geographers, which believe a firm’s localization to be a proxy for innovation and high performance (Claver-Cortès et al. 2015). To continue, Claver-Cortès et al. (2015) explain that a firm’s high performance and innovation rate is given by two factors. The first is the firm’s dynamic capabilities, which refer to organizational skills that allow the business to grow, adapt, create internal and external resources and maintain a competitive advantage despite the changing business environment. The second is the firm’s absorptive capacity, which relates to the ability of the firm to utilise external knowledge to create a competitive advantage over other firms. From this definition it is easy to understand that absorptive capacity comes from a firm’s localisation in dynamic clusters.
Literature on agglomeration suggests that concentration of economic activity generates different outputs for firms. Appold (1995) assumes that profits of firms are positively correlated to the number of firms located near it. To support this, Fritsch et al. (2006) affirms that agglomeration could be beneficial due to a firm’s proximity to research institutions such as universities, to large pools of customers and other companies in the same industry so to facilitate knowledge spillovers. More recent studies found however, that agglomeration can also have negative effects on profits because of higher competition (Arikan and Schilling, 2010). Regarding competition, Fritsch et al. (2006) argue that intensity of competition affects survival chances of a firm suggesting that competition is beneficial until it reaches a certain threshold after which it becomes a negative factor because of too much competition for locations, employees, resources, customers and more. The authors also mention unemployment rate to be relevant in predicting firms’ survival. More specifically, since high unemployment could stand for low growth rate of a region, which might affect a firm’s performance negatively or positively. Second, high unemployment could also mean more availability of labour and should contribute to the success of a firm and last, high unemployment could lead to the creation of new businesses by unemployed people and also this could affect the probability of success or failure of a firm.
Given the background illustrated above, I this report aims at creating a machine learning model able to predict success or failure of new tech businesses in England. The factors that will be taken into consideration are:
The methods used to answer this question are logistic regression and random forest algorithm.
#Â Import packages
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import pysal as ps
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
import statsmodels.api as sm
from collections import Counter
import collections
import itertools
from string import ascii_letters
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from wordcloud import WordCloud
from math import log, sqrt
from sklearn.model_selection import train_test_split
import json
from pandas.io.json import json_normalize
import ast
import altair as alt
from sklearn.linear_model import LogisticRegression
%matplotlib inline
#Â read business data
business_data = pd.read_csv('./business_census2017.csv')
# copy data
bus_data = business_data
#Â explore dataset
bus_data.head(3)
# explore variables
bus_data.info()
# how many observations and how many variables?
bus_data.shape
#Â subset data to keep only variables of interest
bus_data = bus_data.iloc[:, [1,2,3,4,14,17,18,19,22,23,24,25,26,27]]
#Â check variables left
bus_data.info()
#Â keep only tech companies in the dataset by filtering through SIC Code
sic_codes = [62020, 62012, 62090]
data_tech = bus_data.loc[bus_data['siccode'].isin(sic_codes)]
62020 = Information technology consultancy activities
62012 = Business and domestic software development
62090 = Other information technology service activities
data_tech.shape
# select all companies opened within the last 15 years.
#Â convert date columns to date
columns = ['dissolutiondate', 'incorporationdate']
for col in columns:
data_tech[col] = pd.to_datetime(data_tech[col], format='%Y-%m-%d')
# select all companies opened after jan first 2002
data_tech_15 = data_tech.loc[data_tech['incorporationdate'] >= '2002-01-01']
# make city lower case
data_tech_15['posttown'] = data_tech_15['posttown'].str.lower()
#Â reset index
data_tech_15 = data_tech_15.reset_index(drop=True)
data_tech_15.head()
#Â subtrach dissolution date and incorporation date to see how many days the company has been open
data_tech_15['companylife'] = data_tech_15['dissolutiondate'] - data_tech_15['incorporationdate']
#Â delete 'days' from column
data_tech_15['companylife'] = data_tech_15['companylife'].astype(str)
data_tech_15['companylife'] = data_tech_15['companylife'].map(lambda x: str(x)[:-24])
#Â make column float
data_tech_15['companylife'] = pd.to_numeric(data_tech_15['companylife'], errors='coerce')
#Â divide value by 365 to get year value
data_tech_15['companylife'] = data_tech_15['companylife'] / 365
data_tech_15.head(2)
Closed = this is a binary variable
closed = 0 company is still open
closed = 1 company is closed
This will be used as the response variable for machine learning
%%time
#create a function that will assign 0 to all variables unless the company is closed
def closed(row):
value = 0
if row['companylife'] > 0:
value = 1
return value
data_tech_15['closed'] = data_tech_15.apply(closed, axis=1)
closed = data_tech_15['closed'].value_counts()
# rename the axis so that they show in the picture
closed = closed.rename({0: 'open', 1: 'closed'})
# plot graph
closed.plot.barh();
percentage = closed[1] / (closed[0] + closed[1])
print ('The % of closed comapnies in the dataset is: ' + str(percentage *100)+ ' %')
#Â create column containing year only
data_tech_15['year'] = data_tech_15['incorporationdate'].dt.year
#Â create list with unique years
years = sorted((data_tech_15['year'].unique()) , key= int)
#Â create empty list
percentage_closed = []
#Â for each year
for i in years:
#Â filter data by year
datasplit = data_tech_15[data_tech_15['year'] == i]
#Â count frequency of closed and open companies by year
freq = datasplit['closed'].value_counts()
closed = freq[1]
opened = freq[0]
#Â calculate the percentage of closed companies
percentage = (closed / (opened + closed))*100
#Â append percentages on empty list
percentage_closed.append(percentage)
# create a dataframe containing years and percentage of closed companies for that year
percentage_cl = pd.DataFrame({
'incorporation year': years,
'% of closed companies': percentage_closed
})
# Plot the data
percentage_cl.plot.line(x = 'incorporation year', y = '% of closed companies', title = '% percentage of closed companies between 2002 and 2017')
# get unique country of origin values
origin = data_tech_15['countryoforigin'].value_counts()
origin.plot.barh()
All companies are from the United Kingdom
typ = data_tech_15['companycategory'].value_counts()
typ
# semplify the categories
privat = ['Private Limited Company','PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)',
"PRI/LBG/NSC (Private, Limited by guarantee, no share capital, use of 'Limited' exemption)",
'Private Unlimited Company','Private Unlimited']
for i in privat:
data_tech_15['companycategory']=np.where(data_tech_15['companycategory'] == i, 'Private',data_tech_15['companycategory'] )
typ.plot.barh()
Most companies are Private Limited Companies.
Let's look at of open and closed companies are distributed similarly.
closed_type = data_tech_15[data_tech_15['closed']==1]
closed_count = closed_type['companycategory'].value_counts()
closed_count
closed_count.plot.barh()
open_type = data_tech_15[data_tech_15['closed']==0]
open_count = open_type['companycategory'].value_counts()
open_count
open_count.plot.barh()
At this point I am attaching regiona characteristics to my datasets
this information is taken from the 2011 census.
#Â read data
postcodes = pd.read_csv('./postcodes.csv')
# make a copy of the dataset
pcodes = postcodes
# selevt variables of interest (postcode and wardcode)
pcodes = pcodes.iloc[:,[2,6]]
# rename columns
pcodes = pcodes.rename(index=str, columns={"pcds": "postcode", "wd11cd": "ward_code"})
pcodes.head()
Qualifications Qualifications metadata taken from UK Data Service website.
No Qualifications: No academic or professional qualifications. +Level 1 qualifications: 1-4 O Levels/CSE/GCSEs (any grades), Entry Level, Foundation Diploma, NVQ level 1, Foundation GNVQ, Basic/Essential Skills.
Level 2 qualifications: 5+ O Level (Passes)/CSEs (Grade 1)/GCSEs (Grades A*-C), School Certificate, 1 A Level/ 2-3 AS Levels/VCEs, Intermediate/Higher Diploma, Welsh Baccalaureate Intermediate Diploma, NVQ level 2, Intermediate GNVQ, City and Guilds Craft, BTEC First/General Diploma, RSA Diploma Apprenticeship.
Level 3 qualifications: 2+ A Levels/VCEs, 4+ AS Levels, Higher School Certificate, Progression/Advanced Diploma, Welsh Baccalaureate Advanced Diploma, NVQ Level 3; Advanced GNVQ, City and Guilds Advanced Craft, ONC, OND, BTEC National, RSA Advanced Diploma.
Level 4+ qualifications: Degree (for example BA, BSc), Higher Degree (for example MA, PhD, PGCE), NVQ Level 4-5, HNC, HND, RSA Higher Diploma, BTEC Higher level, Foundation degree (NI), Professional qualifications (for example teaching, nursing, accountancy).
Other qualifications: Vocational/Work-related Qualifications, Foreign Qualifications (Not stated/ level unknown).
#Â read data
qualifications = pd.read_csv('./qualifications.csv')
#Â copy dataset
qualif = qualifications
# subset
qualif = qualif.iloc[:, [1,9]]
# remane
qualif = qualif.rename(index=str, columns={"GEO_CODE": "ward_code", "F192": "Economically Active - level_4 qualification"})
#Â delete first row
qualif = qualif.iloc[1:]
qualif.head()
#Â read data
population = pd.read_csv('./residents.csv')
# copy data
pop = population
pop = pop.iloc[:, [1,5]]
pop = pop.iloc[1:]
pop = pop.rename(index=str, columns={"GEO_CODE": "ward_code", "F2384": "population density"})
pop.head()
Economic activity relates to whether or not a person who was aged 16 and over was working or looking for work in the week before census. Rather than a simple indicator of whether or not someone was currently in employment, it provides a measure of whether or not a person was an active participant in the labour market. A person’s economic activity is derived from their ‘activity last week’. This is an indicator of their status or availability for employment - whether employed, unemployed, or their status if not employed and not seeking employment. Additional information included in the economic activity classification is also derived from information about the number of hours a person works and their type of employment - whether employed or self-employed. The census concept of economic activity is compatible with the standard for economic status defined by the International Labour Organisation (ILO). It is one of a number of definitions used internationally to produce accurate and comparable statistics on employment, unemployment and economic status.
#Â read data
unemployment = pd.read_csv('./unemploy.csv')
# copy data
unemploy = unemployment
unemploy = unemploy.iloc[:, [1,5,6]]
unemploy = unemploy.iloc[1:]
unemploy = unemploy.rename(index=str, columns={"GEO_CODE": "ward_code",
"F244": "Economically Active",
'F248': 'Economically Active - Unemployed'})
unemploy.head()
# read data
universities = pd.read_csv('./Harris.csv')
# copy
uni = universities
uni = uni.iloc[:, [4,6,13,15]]
uni = uni.rename(index=str, columns={"Unnamed: 4": "University Name",
'Unnamed: 6': 'University type group',
'Unnamed: 13': 'posttown',
'Unnamed: 15': 'postcode'})
uni = uni.iloc[1:]
uni.head()
#Â check the value of type group.
uni['University type group'].unique()
uni.head()
#Â filter for universities
uni['University type group'].unique()
# filter universities
uni = uni.loc[uni['University type group']== 'Universities']
#Â make posttown lowercase
uni['posttown'] = uni['posttown'].str.lower()
uni = uni.reset_index(drop = True)
uni.head()
#Â recall business dataset
data_tech_15.head()
#Â delete more columns
data_tech_15 = data_tech_15.iloc[:, [2,3,4,5,6,9,10,11,12,15]]
data_tech_15.head()
#Â merge postcode dataset to get ward code
merge1 = pd.merge(data_tech_15, pcodes, on='postcode', how='inner')
merge2 = pd.merge(merge1, pop, on='ward_code', how='left')
merge3 = pd.merge(merge2, qualif, on='ward_code', how='left')
merge4 = pd.merge(merge3, unemploy, on='ward_code', how='left')
merge4.head()
# check type of variables
merge4.info()
#Â make last columns floats
cols = ['population density', 'Economically Active - level_4 qualification', 'Economically Active',\
'Economically Active - Unemployed']
for col in cols:
merge4[col] = merge4[col].astype(float)
# create new variables
merge4['% economically active with level 4 qualification'] = merge4['Economically Active - level_4 qualification']/\
merge4['Economically Active']
merge4['% economically active'] = merge4['Economically Active']/\
merge4['population density']
merge4['% economically active unemployed'] = merge4['Economically Active - Unemployed']/\
merge4['Economically Active']
#Â attach number of companies in the same town, and number of companies in the same ward
companies_in_ward = pd.DataFrame(merge4['ward_code'].value_counts())
companies_in_ward = companies_in_ward.reset_index(drop=False)
companies_in_ward = companies_in_ward.rename(index=str, columns={"index": "ward_code",
'ward_code':'n. of companies in ward'})
companies_in_city = pd.DataFrame(merge4['posttown'].value_counts())
companies_in_city = companies_in_city.reset_index(drop=False)
companies_in_city = companies_in_city.rename(index=str, columns={"index": "posttown",
'posttown':'n. of companies in city'})
merge5 = pd.merge(merge4, companies_in_city, on='posttown', how='left')
merge6 = pd.merge(merge5, companies_in_ward, on='ward_code', how='left')
merge6.head()
#Â delete columns I won't use
merge6 = merge6.iloc[:, [0,1,2,3,4,5,6,7,8,9,11,15,16,17,18,19]]
# add number of universities in the same city
universities_in_city = pd.DataFrame(uni['posttown'].value_counts())
universities_in_city = universities_in_city.reset_index(drop=False)
universities_in_city = universities_in_city.rename(index=str, columns={"index": "posttown",
'posttown':'n. of universities in city'})
universities_in_city.head()
# merge with data
merge7 = pd.merge(merge6, universities_in_city, on='posttown', how='left')
total = merge7
total = total.fillna(0)
total.head()
#Â scatter plot
total_num = total.iloc[:,9:]
total_num = total_num.dropna()
total_num.head()
sns.pairplot(total_num, kind='scatter', hue= 'closed');
#create a function that will code and uncode data
def code(row):
value = 'closed'
if row['closed'] == 0:
value = 'open'
return value
def code_reverse(row):
value = 1
if row['closed'] == 'open':
value = 0
return value
%%time
total['closed'] = total.apply(code, axis = 1)
# values for which to subset
closed= ['closed','open']
# Iterate through the dataset and subset for open and closed companies
for i in closed:
subset = total[total['closed'] == i]
# Draw the density plot
sns.distplot(subset['% economically active unemployed'], hist = False, kde = True,
kde_kws = {'linewidth': 3},
label = 'areas where companies are '+i)
# Plot formatting
plt.legend(prop={'size': 12}, title = '')
plt.title('Lower Unemployment Areas tend to host companies that are still open')
plt.xlabel('unemployment')
plt.ylabel('Density')
The plot shows that areas where companies are still open have a lower percentage of unemployement
# values for which to subset
closed= ['closed','open']
# Iterate through the dataset and subset for open and closed companies
for i in closed:
subset = to_vis[to_vis['closed'] == i]
# Draw the density plot
sns.distplot(subset['% economically active'], hist = False, kde = True,
kde_kws = {'linewidth': 3},
label = 'areas where companies are '+i)
# Plot formatting
plt.legend(prop={'size': 11}, title = '')
plt.title('Areas where there is a greater % of economically active population tend to host open companies')
plt.xlabel('Economically active')
plt.ylabel('Density')
# values for which to subset
closed= ['closed','open']
# Iterate through the dataset and subset for open and closed companies
for i in closed:
subset = to_vis[to_vis['closed'] == i]
# Draw the density plot
sns.distplot(subset['% economically active with level 4 qualification'], hist = False, kde = True,
kde_kws = {'linewidth': 3},
label = 'areas where companies are '+i)
# Plot formatting
plt.legend(prop={'size': 11}, title = '')
plt.title('areas where there is a lower percentage of population with university degrees tend to host closed companies')
plt.xlabel('Higher level education')
plt.ylabel('Density')
I am going to scale data because the variables in the dataset are on different scales. For example, some are percentages, some are n. of morgases etc...
# scale data frame
# columns to scale
toscale = ['nummortcharges', 'nummortoutstanding', \
'nummortpartsatisfied', 'nummortsatisfied', \
'population density','% economically active with level 4 qualification',\
'% economically active', '% economically active unemployed',\
'n. of companies in city', 'n. of companies in ward',\
'n. of universities in city']
#scale
scaled = pd.DataFrame(scale(total[toscale]),
index=total.index,
columns=toscale)
# check that standard deviation is all the same
scaled.describe().head(3)
# add variables to scaled dataset
scaled['closed'] = total['closed']
scaled['companycategory'] = total['companycategory']
# create dummy variables
cat_vars=['companycategory']
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(scaled[var], prefix=var)
data1=scaled.join(cat_list)
data1 = data1.drop(var,1)
scaled=data1
data_vars=scaled.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]
# code variables back to 0 and 1 to run it into the regression function.
scaled['closed'] = scaled.apply(code_reverse, axis = 1)
cols = ['nummortcharges', 'nummortoutstanding', 'nummortpartsatisfied',
'nummortsatisfied', 'population density',
'% economically active with level 4 qualification',
'% economically active', '% economically active unemployed',
'n. of companies in city', 'n. of companies in ward',
'n. of universities in city',
'companycategory_Community Interest Company', 'companycategory_Private',
'companycategory_Public Limited Company']
X = scaled[cols]
y = scaled['closed']
logit_model=sm.Logit(y,X.astype(float))
result=logit_model.fit()
result.summary2()
The only significant variables are:
# keep only significant variables
sig_cols = ['nummortoutstanding', 'population density',
'% economically active with level 4 qualification',
'% economically active', '% economically active unemployed',
'n. of companies in city',
'n. of universities in city',]
X = scaled[sig_cols]
y = scaled['closed']
logit_model=sm.Logit(y,X.astype(float))
result1=logit_model.fit()
result1.summary2()
# split data into training set and testing set
target = ['closed']
x_train, x_test, y_train, y_test = train_test_split(scaled[sig_cols],
scaled[target],
test_size=0.20, random_state=0)
# baseline model
# I am adding class_weight = balanced becasue my dataset is not balanced.
logisticRegr = LogisticRegression(random_state=0, class_weight='balanced')
m1 = logisticRegr.fit(x_train, y_train)
# make prediction
predictions_log = logisticRegr.predict(x_test)
Accuracy is defined as:
correct predictions / total number of data points
# accuracy
score = logisticRegr.score(x_test, y_test)
print('The accuracy of the model is ' + str(score*100)+ ' %')
A confusion matrix shows the number of observations that are assigned to the right class and observations that are assigned to the wrong class
# confusion matrix
cm = metrics.confusion_matrix(y_test, predictions_log)
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);
RF = RandomForestRegressor().fit(x_train, y_train)
predictions_RF = RF.predict(x_test)
score_RF = RF.score(x_test, y_test)
print(score)
Random Forest has the same accuracy of the logistic regression model