Prediction of Business Success/Failure From Business Regional Characteristics (Supervised Learning)

Setting up a new business is an arduous task. Entrepreneurs often embark in new business ventures with high hopes of success, however research shows that a considerable portion of new firms exit the market soon after entering it (Fritsch et al. 2006). Understanding why this happens is crucial to the economy’s stability and health since business bankruptcy is costly not only to business owners and investors but also to the community as a whole (Pompe and Bilderbeek, 2005).

Previous success/fail model studies have focused on using firms’ specific characteristics to predict whether a business fails or succeeds (Lussier and Halabi, 2010). These characteristics include starting capital, industry and management experience of founder, education and age of founder and many others. Although these factors are important in determining success or failure of a business, Fritsch et al. (2006) argue that a limitation of said studies is that they do not account for the regional dimension. In fact, Fritsch et al. (2006) show that regional factors play an important role in predicting the survival of new businesses. This is in line with the concept of agglomeration economies, according to which cities and clusters of activity boost the productivity of firms located within them (Duranton and Kerr, 2015). Examples of industry clusters the Silicon Valley in the USA for tech firms, or the Sheffield (U.K.) area for cutlery manufacturing (Duranton and Overman, 2004).

The tendency of industries to agglomerate has been greatly studied by economists and geographers, which believe a firm’s localization to be a proxy for innovation and high performance (Claver-Cortès et al. 2015). To continue, Claver-Cortès et al. (2015) explain that a firm’s high performance and innovation rate is given by two factors. The first is the firm’s dynamic capabilities, which refer to organizational skills that allow the business to grow, adapt, create internal and external resources and maintain a competitive advantage despite the changing business environment. The second is the firm’s absorptive capacity, which relates to the ability of the firm to utilise external knowledge to create a competitive advantage over other firms. From this definition it is easy to understand that absorptive capacity comes from a firm’s localisation in dynamic clusters.

Literature on agglomeration suggests that concentration of economic activity generates different outputs for firms. Appold (1995) assumes that profits of firms are positively correlated to the number of firms located near it. To support this, Fritsch et al. (2006) affirms that agglomeration could be beneficial due to a firm’s proximity to research institutions such as universities, to large pools of customers and other companies in the same industry so to facilitate knowledge spillovers. More recent studies found however, that agglomeration can also have negative effects on profits because of higher competition (Arikan and Schilling, 2010). Regarding competition, Fritsch et al. (2006) argue that intensity of competition affects survival chances of a firm suggesting that competition is beneficial until it reaches a certain threshold after which it becomes a negative factor because of too much competition for locations, employees, resources, customers and more. The authors also mention unemployment rate to be relevant in predicting firms’ survival. More specifically, since high unemployment could stand for low growth rate of a region, which might affect a firm’s performance negatively or positively. Second, high unemployment could also mean more availability of labour and should contribute to the success of a firm and last, high unemployment could lead to the creation of new businesses by unemployed people and also this could affect the probability of success or failure of a firm.

Given the background illustrated above, I this report aims at creating a machine learning model able to predict success or failure of new tech businesses in England. The factors that will be taken into consideration are:

  • Regional characteristics including:
  • Population density
  • Density of firm in the same sector
  • Unemployment rate
  • Availability of research institutions
  • Firm specific characteristics (limited to data availability)

The methods used to answer this question are logistic regression and random forest algorithm.

In [1]:
# Import packages

import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import pysal as ps
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
import statsmodels.api as sm
from collections import Counter
import collections
import itertools
from string import ascii_letters
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from wordcloud import WordCloud
from math import log, sqrt
from sklearn.model_selection import train_test_split
import json
from pandas.io.json import json_normalize
import ast
import altair as alt
from sklearn.linear_model import LogisticRegression

%matplotlib inline
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [2]:
# read business data
business_data = pd.read_csv('./business_census2017.csv')
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (7,8,9,11,28,29,30,31,32) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
# copy data 
bus_data = business_data
In [4]:
# explore dataset
bus_data.head(3)
Out[4]:
id companynumber countryoforigin dissolutiondate incorporationdate accountingrefday accountingrefmonth account_nextduedate account_lastmadeupdate accountscategory ... nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied siccode sicdescription return_nextduedate return_lastmadeupdate confirmation_nextduedate confirmation_lastmadeupdate
0 1 07382019 UNITED KINGDOM 2018-03-01 2010-09-21 30.0 9.0 2017-06-30 2015-09-30 DORMANT ... 0.0 0.0 0.0 0.0 74100.0 specialised design activities 2016-10-19 2015-09-21 2017-10-05 2016-09-21
1 2 04753368 UNITED KINGDOM 2017-04-01 2003-05-03 30.0 4.0 2017-01-31 2014-10-31 TOTAL EXEMPTION FULL ... 0.0 0.0 0.0 0.0 82990.0 Other business support service activities n.e.c. 2017-05-31 2016-05-03 2018-05-17 NaN
2 3 SC421617 UNITED KINGDOM NaN 2012-04-11 30.0 3.0 2017-12-30 2016-03-30 TOTAL EXEMPTION SMALL ... 0.0 0.0 0.0 0.0 70229.0 Management consultancy activities other than f... 2017-05-09 2016-04-11 2018-04-25 NaN

3 rows × 33 columns

In [5]:
# explore variables
bus_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4077038 entries, 0 to 4077037
Data columns (total 33 columns):
id                             int64
companynumber                  object
countryoforigin                object
dissolutiondate                object
incorporationdate              object
accountingrefday               float64
accountingrefmonth             float64
account_nextduedate            object
account_lastmadeupdate         object
accountscategory               object
careof                         object
pobox                          object
addressline1                   object
addressline2                   object
posttown                       object
county                         object
country                        object
postcode                       object
companycategory                object
companyname                    object
numgenpartners                 float64
numlimpartners                 float64
companystatus                  object
nummortcharges                 float64
nummortoutstanding             float64
nummortpartsatisfied           float64
nummortsatisfied               float64
siccode                        float64
sicdescription                 object
return_nextduedate             object
return_lastmadeupdate          object
confirmation_nextduedate       object
confirmation_lastmadeupdate    object
dtypes: float64(9), int64(1), object(23)
memory usage: 1.0+ GB
In [6]:
# how many observations and how many variables? 
bus_data.shape
Out[6]:
(4077038, 33)

Data Pre-Processing

In [7]:
# subset data to keep only variables of interest
bus_data = bus_data.iloc[:, [1,2,3,4,14,17,18,19,22,23,24,25,26,27]]
In [8]:
# check variables left
bus_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4077038 entries, 0 to 4077037
Data columns (total 14 columns):
companynumber           object
countryoforigin         object
dissolutiondate         object
incorporationdate       object
posttown                object
postcode                object
companycategory         object
companyname             object
companystatus           object
nummortcharges          float64
nummortoutstanding      float64
nummortpartsatisfied    float64
nummortsatisfied        float64
siccode                 float64
dtypes: float64(5), object(9)
memory usage: 435.5+ MB

Filter data set to include tech companies opened withing the last 15 years

In [9]:
# keep only tech companies in the dataset by filtering through SIC Code
sic_codes = [62020, 62012, 62090]
data_tech = bus_data.loc[bus_data['siccode'].isin(sic_codes)]

SIC codes I am keeing

62020 = Information technology consultancy activities

62012 = Business and domestic software development

62090 = Other information technology service activities

In [10]:
data_tech.shape
Out[10]:
(208455, 14)
In [11]:
# select all companies opened within the last 15 years. 

# convert date columns to date 
columns = ['dissolutiondate', 'incorporationdate']

for col in columns:
    data_tech[col] = pd.to_datetime(data_tech[col], format='%Y-%m-%d')

# select all companies opened after jan first 2002
data_tech_15 = data_tech.loc[data_tech['incorporationdate'] >= '2002-01-01']
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
In [12]:
# make city lower case
data_tech_15['posttown'] = data_tech_15['posttown'].str.lower()
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [13]:
# reset index
data_tech_15 = data_tech_15.reset_index(drop=True)
In [14]:
data_tech_15.head()
Out[14]:
companynumber countryoforigin dissolutiondate incorporationdate posttown postcode companycategory companyname companystatus nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied siccode
0 SC394469 UNITED KINGDOM NaT 2011-03-01 aberdeen AB25 2QT Private Limited Company 'RIMS ENTERPRISE SOLUTIONS LTD Active 0.0 0.0 0.0 0.0 62090.0
1 06134719 UNITED KINGDOM 2018-01-01 2007-03-02 london EC2A 4NE Private Limited Company (BRACKET) DESIGN LTD. Active 0.0 0.0 0.0 0.0 62020.0
2 07649435 UNITED KINGDOM NaT 2011-05-26 london EC4N 6NP Private Limited Company JUJU MEDIA LIMITED Active 0.0 0.0 0.0 0.0 62090.0
3 06534666 UNITED KINGDOM NaT 2008-03-14 bristol BS1 2NT Private Limited Company (MID) DOTCOM LIMITED Active 0.0 0.0 0.0 0.0 62020.0
4 05153997 UNITED KINGDOM NaT 2004-06-15 bracknell RG12 1QS Private Limited Company *+ LTD Active 0.0 0.0 0.0 0.0 62090.0

Create new variables

  • Company Life
In [15]:
# subtrach dissolution date and incorporation date to see how many days the company has been open
data_tech_15['companylife'] = data_tech_15['dissolutiondate'] - data_tech_15['incorporationdate']
In [16]:
# delete 'days' from column
data_tech_15['companylife'] = data_tech_15['companylife'].astype(str)
data_tech_15['companylife'] = data_tech_15['companylife'].map(lambda x: str(x)[:-24])

# make column float 
data_tech_15['companylife'] = pd.to_numeric(data_tech_15['companylife'], errors='coerce')

# divide value by 365 to get year value

data_tech_15['companylife'] = data_tech_15['companylife'] / 365
In [17]:
data_tech_15.head(2)
Out[17]:
companynumber countryoforigin dissolutiondate incorporationdate posttown postcode companycategory companyname companystatus nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied siccode companylife
0 SC394469 UNITED KINGDOM NaT 2011-03-01 aberdeen AB25 2QT Private Limited Company 'RIMS ENTERPRISE SOLUTIONS LTD Active 0.0 0.0 0.0 0.0 62090.0 NaN
1 06134719 UNITED KINGDOM 2018-01-01 2007-03-02 london EC2A 4NE Private Limited Company (BRACKET) DESIGN LTD. Active 0.0 0.0 0.0 0.0 62020.0 10.843836
  • Closed = this is a binary variable

    closed = 0 company is still open
    closed = 1 company is closed

This will be used as the response variable for machine learning

In [18]:
%%time
#create a function that will assign 0 to all variables unless the company is closed 
def closed(row):
    value = 0
    if row['companylife'] > 0:
        value = 1
    return value

data_tech_15['closed'] = data_tech_15.apply(closed, axis=1)
CPU times: user 19.5 s, sys: 480 ms, total: 20 s
Wall time: 24.7 s

Data Exploration

  • How many companies are still open, and how many are closed?
In [19]:
closed = data_tech_15['closed'].value_counts()

# rename the axis so that they show in the picture
closed = closed.rename({0: 'open', 1: 'closed'})
# plot graph 

closed.plot.barh();
  • what is the percentage of open companies and closed companies?
In [20]:
percentage = closed[1] / (closed[0] + closed[1])

print ('The % of closed comapnies in the dataset is: ' + str(percentage *100)+ ' %')
The % of closed comapnies in the dataset is: 11.763522258839908 %
  • Are companies founded in specific more likely to be closed?
In [21]:
# create column containing year only 
data_tech_15['year'] = data_tech_15['incorporationdate'].dt.year

# create list with unique years
years = sorted((data_tech_15['year'].unique()) , key= int) 

# create empty list 
percentage_closed = []

# for each year 
for i in years:
    # filter data by year 
    datasplit = data_tech_15[data_tech_15['year'] == i]
    # count frequency of closed and open companies by year 
    freq = datasplit['closed'].value_counts()
    closed = freq[1]
    opened = freq[0]
    # calculate the percentage of closed companies 
    percentage = (closed / (opened + closed))*100
    # append percentages on empty list 
    percentage_closed.append(percentage)
    
# create a dataframe containing years and percentage of closed companies for that year 
percentage_cl = pd.DataFrame({
    'incorporation year': years,
    '% of closed companies': percentage_closed
    })

# Plot the data
percentage_cl.plot.line(x = 'incorporation year', y = '% of closed companies', title = '% percentage of closed companies between 2002 and 2017')
         
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c26fce630>
  • What are the country of origins of the companies?
In [22]:
# get unique country of origin values
origin = data_tech_15['countryoforigin'].value_counts()
In [23]:
origin.plot.barh()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c215cdac8>

All companies are from the United Kingdom

  • What type of companies?
In [24]:
typ = data_tech_15['companycategory'].value_counts()
typ
Out[24]:
Private Limited Company                                                                      183428
PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)                           188
Community Interest Company                                                                      102
PRI/LBG/NSC (Private, Limited by guarantee, no share capital, use of 'Limited' exemption)        87
Public Limited Company                                                                           65
Private Unlimited Company                                                                        11
Private Unlimited                                                                                 1
Name: companycategory, dtype: int64
In [25]:
# semplify the categories
privat = ['Private Limited Company','PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)',
         "PRI/LBG/NSC (Private, Limited by guarantee, no share capital, use of 'Limited' exemption)",
         'Private Unlimited Company','Private Unlimited']

for i in privat: 
    data_tech_15['companycategory']=np.where(data_tech_15['companycategory'] == i, 'Private',data_tech_15['companycategory'] )
In [26]:
typ.plot.barh()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c2549a7b8>

Most companies are Private Limited Companies.

Let's look at of open and closed companies are distributed similarly.

In [27]:
closed_type = data_tech_15[data_tech_15['closed']==1]
closed_count = closed_type['companycategory'].value_counts()

closed_count
Out[27]:
Private                       21608
Community Interest Company       18
Public Limited Company            5
Name: companycategory, dtype: int64
In [28]:
closed_count.plot.barh()
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c253d9f28>
In [29]:
open_type = data_tech_15[data_tech_15['closed']==0]
open_count = open_type['companycategory'].value_counts()

open_count
Out[29]:
Private                       162107
Community Interest Company        84
Public Limited Company            60
Name: companycategory, dtype: int64
In [30]:
open_count.plot.barh()
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c1c863710>

Data Linkage

  • At this point I am attaching regiona characteristics to my datasets

  • this information is taken from the 2011 census.

  • Postcode Data
In [31]:
# read data
postcodes = pd.read_csv('./postcodes.csv')
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (5,8,11) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [32]:
# make a copy of the dataset 
pcodes = postcodes
# selevt variables of interest (postcode and wardcode)
pcodes = pcodes.iloc[:,[2,6]]
# rename columns
pcodes = pcodes.rename(index=str, columns={"pcds": "postcode", "wd11cd": "ward_code"})
pcodes.head()
Out[32]:
postcode ward_code
0 AL1 1AG E05004802
1 AL1 1AJ E05004802
2 AL1 1AR E05004789
3 AL1 1AS E05004799
4 AL1 1BH E05004802
  • Qualifications Data

Qualifications Qualifications metadata taken from UK Data Service website.

  • No Qualifications: No academic or professional qualifications. +Level 1 qualifications: 1-4 O Levels/CSE/GCSEs (any grades), Entry Level, Foundation Diploma, NVQ level 1, Foundation GNVQ, Basic/Essential Skills.

  • Level 2 qualifications: 5+ O Level (Passes)/CSEs (Grade 1)/GCSEs (Grades A*-C), School Certificate, 1 A Level/ 2-3 AS Levels/VCEs, Intermediate/Higher Diploma, Welsh Baccalaureate Intermediate Diploma, NVQ level 2, Intermediate GNVQ, City and Guilds Craft, BTEC First/General Diploma, RSA Diploma Apprenticeship.

  • Level 3 qualifications: 2+ A Levels/VCEs, 4+ AS Levels, Higher School Certificate, Progression/Advanced Diploma, Welsh Baccalaureate Advanced Diploma, NVQ Level 3; Advanced GNVQ, City and Guilds Advanced Craft, ONC, OND, BTEC National, RSA Advanced Diploma.

  • Level 4+ qualifications: Degree (for example BA, BSc), Higher Degree (for example MA, PhD, PGCE), NVQ Level 4-5, HNC, HND, RSA Higher Diploma, BTEC Higher level, Foundation degree (NI), Professional qualifications (for example teaching, nursing, accountancy).

  • Other qualifications: Vocational/Work-related Qualifications, Foreign Qualifications (Not stated/ level unknown).

In [33]:
# read data
qualifications = pd.read_csv('./qualifications.csv')
In [34]:
# copy dataset
qualif = qualifications

# subset 
qualif = qualif.iloc[:, [1,9]]

# remane 
qualif = qualif.rename(index=str, columns={"GEO_CODE": "ward_code", "F192": "Economically Active - level_4 qualification"})

# delete first row
qualif = qualif.iloc[1:]

qualif.head()
Out[34]:
ward_code Economically Active - level_4 qualification
1 95AA01 977
2 95AA02 230
3 95AA03 167
4 95AA04 634
5 95AA05 546
  • Population Density Data
In [35]:
# read data
population = pd.read_csv('./residents.csv')
In [36]:
# copy data 
pop = population
In [37]:
pop = pop.iloc[:, [1,5]]
pop = pop.iloc[1:]
pop = pop.rename(index=str, columns={"GEO_CODE": "ward_code", "F2384": "population density"})
pop.head()
Out[37]:
ward_code population density
1 95AA01 4444
2 95AA02 1734
3 95AA03 1822
4 95AA04 3013
5 95AA05 2701
  • Unemployment Data

Economic activity relates to whether or not a person who was aged 16 and over was working or looking for work in the week before census. Rather than a simple indicator of whether or not someone was currently in employment, it provides a measure of whether or not a person was an active participant in the labour market. A person’s economic activity is derived from their ‘activity last week’. This is an indicator of their status or availability for employment - whether employed, unemployed, or their status if not employed and not seeking employment. Additional information included in the economic activity classification is also derived from information about the number of hours a person works and their type of employment - whether employed or self-employed. The census concept of economic activity is compatible with the standard for economic status defined by the International Labour Organisation (ILO). It is one of a number of definitions used internationally to produce accurate and comparable statistics on employment, unemployment and economic status.

In [38]:
# read data
unemployment = pd.read_csv('./unemploy.csv')
In [39]:
# copy data 
unemploy = unemployment
In [40]:
unemploy = unemploy.iloc[:, [1,5,6]]
unemploy = unemploy.iloc[1:]
unemploy = unemploy.rename(index=str, columns={"GEO_CODE": "ward_code", 
                                     "F244": "Economically Active",
                                     'F248': 'Economically Active - Unemployed'})
unemploy.head()
Out[40]:
ward_code Economically Active Economically Active - Unemployed
1 95AA01 2473 107
2 95AA02 892 57
3 95AA03 840 57
4 95AA04 1660 72
5 95AA05 1362 77
  • Universities Data
In [41]:
# read data 
universities = pd.read_csv('./Harris.csv')
In [42]:
# copy 
uni = universities
In [43]:
uni = uni.iloc[:, [4,6,13,15]]
uni = uni.rename(index=str, columns={"Unnamed: 4": "University Name", 
                                     'Unnamed: 6': 'University type group',
                                     'Unnamed: 13': 'posttown',
                                     'Unnamed: 15': 'postcode'})
uni = uni.iloc[1:]


uni.head()
Out[43]:
University Name University type group posttown postcode
1 Sir John Cass's Foundation Primary School LA maintained schools London EC3A 5DE
2 City of London School for Girls Independent schools London EC2Y 8BB
3 St Paul's Cathedral School Independent schools London EC4M 9AD
4 City of London School Independent schools London EC4V 3AL
5 The Charterhouse Square School Independent schools London EC1M 6EA
In [44]:
# check the value of type group. 
uni['University type group'].unique()
Out[44]:
array(['LA maintained schools', 'Independent schools', 'Universities',
       'Free Schools', 'Academies', 'Colleges', 'Special schools',
       'Other types'], dtype=object)
In [45]:
uni.head()
Out[45]:
University Name University type group posttown postcode
1 Sir John Cass's Foundation Primary School LA maintained schools London EC3A 5DE
2 City of London School for Girls Independent schools London EC2Y 8BB
3 St Paul's Cathedral School Independent schools London EC4M 9AD
4 City of London School Independent schools London EC4V 3AL
5 The Charterhouse Square School Independent schools London EC1M 6EA
In [46]:
# filter for universities
uni['University type group'].unique()
Out[46]:
array(['LA maintained schools', 'Independent schools', 'Universities',
       'Free Schools', 'Academies', 'Colleges', 'Special schools',
       'Other types'], dtype=object)
In [47]:
# filter universities
uni = uni.loc[uni['University type group']== 'Universities']
In [48]:
# make posttown lowercase 
uni['posttown'] = uni['posttown'].str.lower()
uni = uni.reset_index(drop = True)
In [49]:
uni.head()
Out[49]:
University Name University type group posttown postcode
0 London Metropolitan University Universities london N7 8DB
1 Guildhall School of Music and Drama Universities london EC2Y 8DT
2 The Royal Veterinary College Universities london NW1 0TU
3 Central School of Speech and Drama Universities london NW3 3HY
4 University College London Universities london WC1E 6BT
In [50]:
# recall business dataset
data_tech_15.head()
Out[50]:
companynumber countryoforigin dissolutiondate incorporationdate posttown postcode companycategory companyname companystatus nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied siccode companylife closed year
0 SC394469 UNITED KINGDOM NaT 2011-03-01 aberdeen AB25 2QT Private 'RIMS ENTERPRISE SOLUTIONS LTD Active 0.0 0.0 0.0 0.0 62090.0 NaN 0 2011
1 06134719 UNITED KINGDOM 2018-01-01 2007-03-02 london EC2A 4NE Private (BRACKET) DESIGN LTD. Active 0.0 0.0 0.0 0.0 62020.0 10.843836 1 2007
2 07649435 UNITED KINGDOM NaT 2011-05-26 london EC4N 6NP Private JUJU MEDIA LIMITED Active 0.0 0.0 0.0 0.0 62090.0 NaN 0 2011
3 06534666 UNITED KINGDOM NaT 2008-03-14 bristol BS1 2NT Private (MID) DOTCOM LIMITED Active 0.0 0.0 0.0 0.0 62020.0 NaN 0 2008
4 05153997 UNITED KINGDOM NaT 2004-06-15 bracknell RG12 1QS Private *+ LTD Active 0.0 0.0 0.0 0.0 62090.0 NaN 0 2004
In [51]:
# delete more columns
data_tech_15 = data_tech_15.iloc[:, [2,3,4,5,6,9,10,11,12,15]]
data_tech_15.head()
Out[51]:
dissolutiondate incorporationdate posttown postcode companycategory nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied closed
0 NaT 2011-03-01 aberdeen AB25 2QT Private 0.0 0.0 0.0 0.0 0
1 2018-01-01 2007-03-02 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1
2 NaT 2011-05-26 london EC4N 6NP Private 0.0 0.0 0.0 0.0 0
3 NaT 2008-03-14 bristol BS1 2NT Private 0.0 0.0 0.0 0.0 0
4 NaT 2004-06-15 bracknell RG12 1QS Private 0.0 0.0 0.0 0.0 0
In [52]:
# merge postcode dataset to get ward code
merge1 = pd.merge(data_tech_15, pcodes, on='postcode', how='inner')
merge2 = pd.merge(merge1, pop, on='ward_code', how='left')
merge3 = pd.merge(merge2, qualif, on='ward_code', how='left')
merge4 = pd.merge(merge3, unemploy, on='ward_code', how='left')
In [53]:
merge4.head()
Out[53]:
dissolutiondate incorporationdate posttown postcode companycategory nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied closed ward_code population density Economically Active - level_4 qualification Economically Active Economically Active - Unemployed
0 2018-01-01 2007-03-02 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1 E05000239 13904 5351 8294 736
1 NaT 2011-08-19 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 E05000239 13904 5351 8294 736
2 2017-12-01 2008-10-29 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1 E05000239 13904 5351 8294 736
3 NaT 2007-04-11 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 E05000239 13904 5351 8294 736
4 NaT 2010-07-09 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 E05000239 13904 5351 8294 736
In [54]:
# check type of variables
merge4.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129112 entries, 0 to 129111
Data columns (total 15 columns):
dissolutiondate                                14794 non-null datetime64[ns]
incorporationdate                              129112 non-null datetime64[ns]
posttown                                       128057 non-null object
postcode                                       129112 non-null object
companycategory                                129112 non-null object
nummortcharges                                 129112 non-null float64
nummortoutstanding                             129112 non-null float64
nummortpartsatisfied                           129112 non-null float64
nummortsatisfied                               129112 non-null float64
closed                                         129112 non-null int64
ward_code                                      129112 non-null object
population density                             128677 non-null object
Economically Active - level_4 qualification    128677 non-null object
Economically Active                            128677 non-null object
Economically Active - Unemployed               128677 non-null object
dtypes: datetime64[ns](2), float64(4), int64(1), object(8)
memory usage: 15.8+ MB
In [55]:
# make last columns floats
cols = ['population density', 'Economically Active - level_4 qualification', 'Economically Active',\
        'Economically Active - Unemployed']
for col in cols:
    merge4[col] = merge4[col].astype(float)
In [56]:
# create new variables 
merge4['% economically active with level 4 qualification'] = merge4['Economically Active - level_4 qualification']/\
                                                            merge4['Economically Active']

merge4['% economically active'] = merge4['Economically Active']/\
                                  merge4['population density']
merge4['% economically active unemployed'] = merge4['Economically Active - Unemployed']/\
                                             merge4['Economically Active']
In [57]:
# attach number of companies in the same town, and number of companies in the same ward
companies_in_ward = pd.DataFrame(merge4['ward_code'].value_counts())
companies_in_ward = companies_in_ward.reset_index(drop=False)
companies_in_ward = companies_in_ward.rename(index=str, columns={"index": "ward_code",
                                                                 'ward_code':'n. of companies in ward'})



companies_in_city = pd.DataFrame(merge4['posttown'].value_counts())
companies_in_city = companies_in_city.reset_index(drop=False)
companies_in_city = companies_in_city.rename(index=str, columns={"index": "posttown",
                                                                 'posttown':'n. of companies in city'})
In [58]:
merge5 = pd.merge(merge4, companies_in_city, on='posttown', how='left')
merge6 = pd.merge(merge5, companies_in_ward, on='ward_code', how='left')
In [59]:
merge6.head()
Out[59]:
dissolutiondate incorporationdate posttown postcode companycategory nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied closed ward_code population density Economically Active - level_4 qualification Economically Active Economically Active - Unemployed % economically active with level 4 qualification % economically active % economically active unemployed n. of companies in city n. of companies in ward
0 2018-01-01 2007-03-02 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1 E05000239 13904.0 5351.0 8294.0 736.0 0.645165 0.596519 0.088739 23260.0 639
1 NaT 2011-08-19 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 E05000239 13904.0 5351.0 8294.0 736.0 0.645165 0.596519 0.088739 23260.0 639
2 2017-12-01 2008-10-29 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1 E05000239 13904.0 5351.0 8294.0 736.0 0.645165 0.596519 0.088739 23260.0 639
3 NaT 2007-04-11 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 E05000239 13904.0 5351.0 8294.0 736.0 0.645165 0.596519 0.088739 23260.0 639
4 NaT 2010-07-09 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 E05000239 13904.0 5351.0 8294.0 736.0 0.645165 0.596519 0.088739 23260.0 639
In [60]:
# delete columns I won't use
merge6 = merge6.iloc[:, [0,1,2,3,4,5,6,7,8,9,11,15,16,17,18,19]]
In [61]:
# add number of universities in the same city
universities_in_city = pd.DataFrame(uni['posttown'].value_counts())
universities_in_city = universities_in_city.reset_index(drop=False)
universities_in_city = universities_in_city.rename(index=str, columns={"index": "posttown",
                                                                 'posttown':'n. of universities in city'})

universities_in_city.head()
Out[61]:
posttown n. of universities in city
0 london 37
1 leeds 6
2 manchester 5
3 birmingham 5
4 liverpool 4
In [62]:
# merge with data 
merge7 = pd.merge(merge6, universities_in_city, on='posttown', how='left')
In [63]:
total = merge7
total = total.fillna(0)
total.head()
Out[63]:
dissolutiondate incorporationdate posttown postcode companycategory nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied closed population density % economically active with level 4 qualification % economically active % economically active unemployed n. of companies in city n. of companies in ward n. of universities in city
0 2018-01-01 00:00:00 2007-03-02 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
1 0 2011-08-19 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
2 2017-12-01 00:00:00 2008-10-29 london EC2A 4NE Private 0.0 0.0 0.0 0.0 1 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
3 0 2007-04-11 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
4 0 2010-07-09 london EC2A 4NE Private 0.0 0.0 0.0 0.0 0 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0

More Data Visualisation

  • I will first plot a pairplot to see the relationships among each pair of variables.
In [64]:
# scatter plot
total_num = total.iloc[:,9:]
total_num = total_num.dropna()
total_num.head()
Out[64]:
closed population density % economically active with level 4 qualification % economically active % economically active unemployed n. of companies in city n. of companies in ward n. of universities in city
0 1 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
1 0 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
2 1 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
3 0 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
4 0 13904.0 0.645165 0.596519 0.088739 23260.0 639 37.0
In [65]:
sns.pairplot(total_num, kind='scatter', hue= 'closed');
  • Next I will plot density plots to see the distribution of the variables for companies that are still open VS companies that are closed. this will allow me to see immediate differences in the distributions.
In [66]:
#create a function that will code and uncode data
def code(row):
    value = 'closed'
    if row['closed'] == 0:
        value = 'open'
    return value

def code_reverse(row):
    value = 1
    if row['closed'] == 'open':
        value = 0
    return value
In [68]:
%%time
total['closed'] = total.apply(code, axis = 1)
CPU times: user 12.7 s, sys: 270 ms, total: 12.9 s
Wall time: 14.7 s
In [70]:
# values for which to subset
closed= ['closed','open']

# Iterate through the dataset and subset for open and closed companies
for i in closed:
    subset = total[total['closed'] == i]
    
    # Draw the density plot
    sns.distplot(subset['% economically active unemployed'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'areas where companies are '+i)
    
# Plot formatting
plt.legend(prop={'size': 12}, title = '')
plt.title('Lower Unemployment Areas tend to host companies that are still open')
plt.xlabel('unemployment')
plt.ylabel('Density')
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[70]:
Text(0,0.5,'Density')

The plot shows that areas where companies are still open have a lower percentage of unemployement

In [71]:
# values for which to subset
closed= ['closed','open']

# Iterate through the dataset and subset for open and closed companies
for i in closed:
    subset = to_vis[to_vis['closed'] == i]
    
    # Draw the density plot
    sns.distplot(subset['% economically active'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'areas where companies are '+i)
    
# Plot formatting
plt.legend(prop={'size': 11}, title = '')
plt.title('Areas where there is a greater % of economically active population tend to host open companies')
plt.xlabel('Economically active')
plt.ylabel('Density')
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[71]:
Text(0,0.5,'Density')
  • The distribution are very similar.
  • **The spike at around 0.6 shows that areas where companies are still open, also have a higher percentage of people that are economically active and therefore contribute to the workforce.
In [72]:
# values for which to subset
closed= ['closed','open']

# Iterate through the dataset and subset for open and closed companies
for i in closed:
    subset = to_vis[to_vis['closed'] == i]
    
    # Draw the density plot
    sns.distplot(subset['% economically active with level 4 qualification'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'areas where companies are '+i)
    
# Plot formatting
plt.legend(prop={'size': 11}, title = '')
plt.title('areas where there is a lower percentage of population with university degrees tend to host closed companies')
plt.xlabel('Higher level education')
plt.ylabel('Density')
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[72]:
Text(0,0.5,'Density')
  • The distribution shows that areas that have a lower percentage of people with a higher education degree are area where companies are currently closed (0.25).
  • On the contrary,at areas where 75% of economically active population has a higher education degree, there is higher density of opened companies other than closed companies.

Machine Learning

Logistic Regression

Step 1: scale data

I am going to scale data because the variables in the dataset are on different scales. For example, some are percentages, some are n. of morgases etc...

In [75]:
# scale data frame 
# columns to scale
toscale = ['nummortcharges', 'nummortoutstanding', \
'nummortpartsatisfied', 'nummortsatisfied', \
'population density','% economically active with level 4 qualification',\
'% economically active', '% economically active unemployed',\
'n. of companies in city', 'n. of companies in ward',\
'n. of universities in city']

#scale
scaled = pd.DataFrame(scale(total[toscale]),
                   index=total.index,
                   columns=toscale)
In [76]:
# check that standard deviation is all the same
scaled.describe().head(3)
Out[76]:
nummortcharges nummortoutstanding nummortpartsatisfied nummortsatisfied population density % economically active with level 4 qualification % economically active % economically active unemployed n. of companies in city n. of companies in ward n. of universities in city
count 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05 1.291120e+05
mean 3.874327e-17 9.685817e-18 -6.879131e-19 1.144687e-17 -9.179513e-17 2.888135e-16 5.608969e-16 -3.486894e-16 1.375826e-17 -6.339808e-17 5.354716e-17
std 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00 1.000004e+00
In [77]:
# add variables to scaled dataset 
scaled['closed'] = total['closed']
scaled['companycategory'] = total['companycategory']
In [78]:
# create dummy variables 
cat_vars=['companycategory']

for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(scaled[var], prefix=var)
    data1=scaled.join(cat_list)
    data1 = data1.drop(var,1)
    scaled=data1

data_vars=scaled.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]
In [79]:
# code variables back to 0 and 1 to run it into the regression function. 
scaled['closed'] = scaled.apply(code_reverse, axis = 1)

Step 2: Implement model and see which variables are significant and which ones aren't

In [80]:
cols = ['nummortcharges', 'nummortoutstanding', 'nummortpartsatisfied',
       'nummortsatisfied', 'population density',
       '% economically active with level 4 qualification',
       '% economically active', '% economically active unemployed',
       'n. of companies in city', 'n. of companies in ward',
       'n. of universities in city',
       'companycategory_Community Interest Company', 'companycategory_Private',
       'companycategory_Public Limited Company']

X = scaled[cols]
y = scaled['closed']

logit_model=sm.Logit(y,X.astype(float))
result=logit_model.fit()
result.summary2()
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.354193
         Iterations: 35
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
Out[80]:
Model: Logit No. Iterations: 35.0000
Dependent Variable: closed Pseudo R-squared: 0.005
Date: 2018-09-25 18:25 AIC: 91489.1146
No. Observations: 129112 BIC: 91625.8727
Df Model: 13 Log-Likelihood: -45731.
Df Residuals: 129098 LL-Null: -45963.
Converged: 0.0000 Scale: 1.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
nummortcharges 0.6645 0.3749 1.7728 0.0763 -0.0702 1.3992
nummortoutstanding -0.6325 0.2761 -2.2908 0.0220 -1.1736 -0.0913
nummortpartsatisfied -0.0744 1485.9425 -0.0001 1.0000 -2912.4681 2912.3193
nummortsatisfied -0.3745 0.1952 -1.9186 0.0550 -0.7570 0.0081
population density -0.0274 0.0110 -2.4858 0.0129 -0.0489 -0.0058
% economically active with level 4 qualification -0.0528 0.0116 -4.5402 0.0000 -0.0756 -0.0300
% economically active 0.0549 0.0102 5.3803 0.0000 0.0349 0.0749
% economically active unemployed 0.1267 0.0117 10.8227 0.0000 0.1038 0.1497
n. of companies in city -0.4640 0.1370 -3.3862 0.0007 -0.7325 -0.1954
n. of companies in ward -0.0152 0.0113 -1.3377 0.1810 -0.0374 0.0071
n. of universities in city 0.4980 0.1363 3.6538 0.0003 0.2309 0.7652
companycategory_Community Interest Company -1.4441 4.1463 -0.3483 0.7276 -9.5707 6.6824
companycategory_Private -2.0613 4.1354 -0.4984 0.6182 -10.1666 6.0440
companycategory_Public Limited Company -2.4950 4.2638 -0.5852 0.5584 -10.8519 5.8619

The only significant variables are:

  • Number of mortgages outstanding
  • population density
  • % of economically active people with higher education degrees
  • % of economically active people over the whole population
  • % of economically active people that are unemployed
  • number of other tech companies in the same city
  • number of universities in the city
In [81]:
# keep only significant variables 
sig_cols = ['nummortoutstanding', 'population density',
       '% economically active with level 4 qualification',
       '% economically active', '% economically active unemployed',
       'n. of companies in city',
       'n. of universities in city',]

X = scaled[sig_cols]
y = scaled['closed']

logit_model=sm.Logit(y,X.astype(float))
result1=logit_model.fit()
result1.summary2()
Optimization terminated successfully.
         Current function value: 0.692488
         Iterations 4
Out[81]:
Model: Logit No. Iterations: 4.0000
Dependent Variable: closed Pseudo R-squared: -0.945
Date: 2018-09-25 18:25 AIC: 178831.0761
No. Observations: 129112 BIC: 178899.4552
Df Model: 6 Log-Likelihood: -89409.
Df Residuals: 129105 LL-Null: -45963.
Converged: 1.0000 Scale: 1.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
nummortoutstanding -0.0330 0.0058 -5.6821 0.0000 -0.0444 -0.0216
population density -0.0126 0.0069 -1.8199 0.0688 -0.0261 0.0010
% economically active with level 4 qualification -0.0210 0.0071 -2.9643 0.0030 -0.0350 -0.0071
% economically active 0.0196 0.0061 3.1922 0.0014 0.0076 0.0316
% economically active unemployed 0.0548 0.0077 7.1405 0.0000 0.0398 0.0699
n. of companies in city -0.2110 0.0749 -2.8186 0.0048 -0.3577 -0.0643
n. of universities in city 0.2240 0.0745 3.0047 0.0027 0.0779 0.3701

Step 3: Train the regression

In [82]:
# split data into training set and testing set 
target = ['closed']
x_train, x_test, y_train, y_test = train_test_split(scaled[sig_cols],
                                                    scaled[target], 
                                                    test_size=0.20, random_state=0)
In [83]:
# baseline model 
# I am adding class_weight = balanced becasue my dataset is not balanced. 
logisticRegr = LogisticRegression(random_state=0, class_weight='balanced')
m1 = logisticRegr.fit(x_train, y_train)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
In [84]:
# make prediction 
predictions_log = logisticRegr.predict(x_test)

Step 4: Test Accuracy of baseline model

Accuracy is defined as:

correct predictions / total number of data points

In [91]:
# accuracy 
score = logisticRegr.score(x_test, y_test)
print('The accuracy of the model is ' + str(score*100)+ ' %')
The accuracy of the model is 54.536653371025835 %

Step 5: Show confusion matrix

A confusion matrix shows the number of observations that are assigned to the right class and observations that are assigned to the wrong class

In [86]:
# confusion matrix
cm = metrics.confusion_matrix(y_test, predictions_log)

plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

Step 6: Improve the model with random forest

In [87]:
RF = RandomForestRegressor().fit(x_train, y_train)
/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """Entry point for launching an IPython kernel.
In [88]:
predictions_RF = RF.predict(x_test)
In [89]:
score_RF = RF.score(x_test, y_test)
print(score)
0.5453665337102583

Random Forest has the same accuracy of the logistic regression model