Prediction of Business Success/Failure From Business Regional Characteristics (Supervised Learning)¶

Setting up a new business is an arduous task. Entrepreneurs often embark in new business ventures with high hopes of success, however research shows that a considerable portion of new firms exit the market soon after entering it (Fritsch et al. 2006). Understanding why this happens is crucial to the economy’s stability and health since business bankruptcy is costly not only to business owners and investors but also to the community as a whole (Pompe and Bilderbeek, 2005).

Previous success/fail model studies have focused on using firms’ specific characteristics to predict whether a business fails or succeeds (Lussier and Halabi, 2010). These characteristics include starting capital, industry and management experience of founder, education and age of founder and many others. Although these factors are important in determining success or failure of a business, Fritsch et al. (2006) argue that a limitation of said studies is that they do not account for the regional dimension. In fact, Fritsch et al. (2006) show that regional factors play an important role in predicting the survival of new businesses. This is in line with the concept of agglomeration economies, according to which cities and clusters of activity boost the productivity of firms located within them (Duranton and Kerr, 2015). Examples of industry clusters the Silicon Valley in the USA for tech firms, or the Sheffield (U.K.) area for cutlery manufacturing (Duranton and Overman, 2004).

The tendency of industries to agglomerate has been greatly studied by economists and geographers, which believe a firm’s localization to be a proxy for innovation and high performance (Claver-Cortès et al. 2015). To continue, Claver-Cortès et al. (2015) explain that a firm’s high performance and innovation rate is given by two factors. The first is the firm’s dynamic capabilities, which refer to organizational skills that allow the business to grow, adapt, create internal and external resources and maintain a competitive advantage despite the changing business environment. The second is the firm’s absorptive capacity, which relates to the ability of the firm to utilise external knowledge to create a competitive advantage over other firms. From this definition it is easy to understand that absorptive capacity comes from a firm’s localisation in dynamic clusters.

Literature on agglomeration suggests that concentration of economic activity generates different outputs for firms. Appold (1995) assumes that profits of firms are positively correlated to the number of firms located near it. To support this, Fritsch et al. (2006) affirms that agglomeration could be beneficial due to a firm’s proximity to research institutions such as universities, to large pools of customers and other companies in the same industry so to facilitate knowledge spillovers. More recent studies found however, that agglomeration can also have negative effects on profits because of higher competition (Arikan and Schilling, 2010). Regarding competition, Fritsch et al. (2006) argue that intensity of competition affects survival chances of a firm suggesting that competition is beneficial until it reaches a certain threshold after which it becomes a negative factor because of too much competition for locations, employees, resources, customers and more. The authors also mention unemployment rate to be relevant in predicting firms’ survival. More specifically, since high unemployment could stand for low growth rate of a region, which might affect a firm’s performance negatively or positively. Second, high unemployment could also mean more availability of labour and should contribute to the success of a firm and last, high unemployment could lead to the creation of new businesses by unemployed people and also this could affect the probability of success or failure of a firm.

Given the background illustrated above, I this report aims at creating a machine learning model able to predict success or failure of new tech businesses in England. The factors that will be taken into consideration are:

Regional characteristics including:
Population density
Density of firm in the same sector
Unemployment rate
Availability of research institutions
Firm specific characteristics (limited to data availability)

The methods used to answer this question are logistic regression and random forest algorithm.

# Import packages

import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import pysal as ps
from sklearn import cluster, metrics
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
import statsmodels.api as sm
from collections import Counter
import collections
import itertools
from string import ascii_letters
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from wordcloud import WordCloud
from math import log, sqrt
from sklearn.model_selection import train_test_split
import json
from pandas.io.json import json_normalize
import ast
import altair as alt
from sklearn.linear_model import LogisticRegression

%matplotlib inline

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

# read business data
business_data = pd.read_csv('./business_census2017.csv')

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (7,8,9,11,28,29,30,31,32) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

# copy data 
bus_data = business_data

# explore dataset
bus_data.head(3)

# explore variables
bus_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4077038 entries, 0 to 4077037
Data columns (total 33 columns):
id                             int64
companynumber                  object
countryoforigin                object
dissolutiondate                object
incorporationdate              object
accountingrefday               float64
accountingrefmonth             float64
account_nextduedate            object
account_lastmadeupdate         object
accountscategory               object
careof                         object
pobox                          object
addressline1                   object
addressline2                   object
posttown                       object
county                         object
country                        object
postcode                       object
companycategory                object
companyname                    object
numgenpartners                 float64
numlimpartners                 float64
companystatus                  object
nummortcharges                 float64
nummortoutstanding             float64
nummortpartsatisfied           float64
nummortsatisfied               float64
siccode                        float64
sicdescription                 object
return_nextduedate             object
return_lastmadeupdate          object
confirmation_nextduedate       object
confirmation_lastmadeupdate    object
dtypes: float64(9), int64(1), object(23)
memory usage: 1.0+ GB

# how many observations and how many variables? 
bus_data.shape

(4077038, 33)

Data Pre-Processing¶

# subset data to keep only variables of interest
bus_data = bus_data.iloc[:, [1,2,3,4,14,17,18,19,22,23,24,25,26,27]]

# check variables left
bus_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4077038 entries, 0 to 4077037
Data columns (total 14 columns):
companynumber           object
countryoforigin         object
dissolutiondate         object
incorporationdate       object
posttown                object
postcode                object
companycategory         object
companyname             object
companystatus           object
nummortcharges          float64
nummortoutstanding      float64
nummortpartsatisfied    float64
nummortsatisfied        float64
siccode                 float64
dtypes: float64(5), object(9)
memory usage: 435.5+ MB

Filter data set to include tech companies opened withing the last 15 years¶

# keep only tech companies in the dataset by filtering through SIC Code
sic_codes = [62020, 62012, 62090]
data_tech = bus_data.loc[bus_data['siccode'].isin(sic_codes)]

SIC codes I am keeing¶

62020 = Information technology consultancy activities

62012 = Business and domestic software development

62090 = Other information technology service activities

data_tech.shape

(208455, 14)

# select all companies opened within the last 15 years. 

# convert date columns to date 
columns = ['dissolutiondate', 'incorporationdate']

for col in columns:
    data_tech[col] = pd.to_datetime(data_tech[col], format='%Y-%m-%d')

# select all companies opened after jan first 2002
data_tech_15 = data_tech.loc[data_tech['incorporationdate'] >= '2002-01-01']

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys

# make city lower case
data_tech_15['posttown'] = data_tech_15['posttown'].str.lower()

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

# reset index
data_tech_15 = data_tech_15.reset_index(drop=True)

data_tech_15.head()

Create new variables¶

Company Life

# subtrach dissolution date and incorporation date to see how many days the company has been open
data_tech_15['companylife'] = data_tech_15['dissolutiondate'] - data_tech_15['incorporationdate']

# delete 'days' from column
data_tech_15['companylife'] = data_tech_15['companylife'].astype(str)
data_tech_15['companylife'] = data_tech_15['companylife'].map(lambda x: str(x)[:-24])

# make column float 
data_tech_15['companylife'] = pd.to_numeric(data_tech_15['companylife'], errors='coerce')

# divide value by 365 to get year value

data_tech_15['companylife'] = data_tech_15['companylife'] / 365

data_tech_15.head(2)

Closed = this is a binary variable

closed = 0 company is still open
closed = 1 company is closed

This will be used as the response variable for machine learning

%%time
#create a function that will assign 0 to all variables unless the company is closed 
def closed(row):
    value = 0
    if row['companylife'] > 0:
        value = 1
    return value

data_tech_15['closed'] = data_tech_15.apply(closed, axis=1)

CPU times: user 19.5 s, sys: 480 ms, total: 20 s
Wall time: 24.7 s

Data Exploration¶

How many companies are still open, and how many are closed?

closed = data_tech_15['closed'].value_counts()

# rename the axis so that they show in the picture
closed = closed.rename({0: 'open', 1: 'closed'})
# plot graph 

closed.plot.barh();

what is the percentage of open companies and closed companies?

percentage = closed[1] / (closed[0] + closed[1])

print ('The % of closed comapnies in the dataset is: ' + str(percentage *100)+ ' %')

The % of closed comapnies in the dataset is: 11.763522258839908 %

Are companies founded in specific more likely to be closed?

# create column containing year only 
data_tech_15['year'] = data_tech_15['incorporationdate'].dt.year

# create list with unique years
years = sorted((data_tech_15['year'].unique()) , key= int) 

# create empty list 
percentage_closed = []

# for each year 
for i in years:
    # filter data by year 
    datasplit = data_tech_15[data_tech_15['year'] == i]
    # count frequency of closed and open companies by year 
    freq = datasplit['closed'].value_counts()
    closed = freq[1]
    opened = freq[0]
    # calculate the percentage of closed companies 
    percentage = (closed / (opened + closed))*100
    # append percentages on empty list 
    percentage_closed.append(percentage)
    
# create a dataframe containing years and percentage of closed companies for that year 
percentage_cl = pd.DataFrame({
    'incorporation year': years,
    '% of closed companies': percentage_closed
    })

# Plot the data
percentage_cl.plot.line(x = 'incorporation year', y = '% of closed companies', title = '% percentage of closed companies between 2002 and 2017')

<matplotlib.axes._subplots.AxesSubplot at 0x1c26fce630>

What are the country of origins of the companies?

# get unique country of origin values
origin = data_tech_15['countryoforigin'].value_counts()

origin.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x1c215cdac8>

All companies are from the United Kingdom

What type of companies?

typ = data_tech_15['companycategory'].value_counts()
typ

Private Limited Company                                                                      183428
PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)                           188
Community Interest Company                                                                      102
PRI/LBG/NSC (Private, Limited by guarantee, no share capital, use of 'Limited' exemption)        87
Public Limited Company                                                                           65
Private Unlimited Company                                                                        11
Private Unlimited                                                                                 1
Name: companycategory, dtype: int64

# semplify the categories
privat = ['Private Limited Company','PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)',
         "PRI/LBG/NSC (Private, Limited by guarantee, no share capital, use of 'Limited' exemption)",
         'Private Unlimited Company','Private Unlimited']

for i in privat: 
    data_tech_15['companycategory']=np.where(data_tech_15['companycategory'] == i, 'Private',data_tech_15['companycategory'] )

typ.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x1c2549a7b8>

Most companies are Private Limited Companies.

Let's look at of open and closed companies are distributed similarly.

closed_type = data_tech_15[data_tech_15['closed']==1]
closed_count = closed_type['companycategory'].value_counts()

closed_count

Private                       21608
Community Interest Company       18
Public Limited Company            5
Name: companycategory, dtype: int64

closed_count.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x1c253d9f28>

open_type = data_tech_15[data_tech_15['closed']==0]
open_count = open_type['companycategory'].value_counts()

open_count

Private                       162107
Community Interest Company        84
Public Limited Company            60
Name: companycategory, dtype: int64

open_count.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x1c1c863710>

Data Linkage¶

At this point I am attaching regiona characteristics to my datasets
this information is taken from the 2011 census.

Postcode Data

# read data
postcodes = pd.read_csv('./postcodes.csv')

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (5,8,11) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

# make a copy of the dataset 
pcodes = postcodes
# selevt variables of interest (postcode and wardcode)
pcodes = pcodes.iloc[:,[2,6]]
# rename columns
pcodes = pcodes.rename(index=str, columns={"pcds": "postcode", "wd11cd": "ward_code"})
pcodes.head()

Qualifications Data

Qualifications Qualifications metadata taken from UK Data Service website.

No Qualifications: No academic or professional qualifications. +Level 1 qualifications: 1-4 O Levels/CSE/GCSEs (any grades), Entry Level, Foundation Diploma, NVQ level 1, Foundation GNVQ, Basic/Essential Skills.
Level 2 qualifications: 5+ O Level (Passes)/CSEs (Grade 1)/GCSEs (Grades A*-C), School Certificate, 1 A Level/ 2-3 AS Levels/VCEs, Intermediate/Higher Diploma, Welsh Baccalaureate Intermediate Diploma, NVQ level 2, Intermediate GNVQ, City and Guilds Craft, BTEC First/General Diploma, RSA Diploma Apprenticeship.
Level 3 qualifications: 2+ A Levels/VCEs, 4+ AS Levels, Higher School Certificate, Progression/Advanced Diploma, Welsh Baccalaureate Advanced Diploma, NVQ Level 3; Advanced GNVQ, City and Guilds Advanced Craft, ONC, OND, BTEC National, RSA Advanced Diploma.
Level 4+ qualifications: Degree (for example BA, BSc), Higher Degree (for example MA, PhD, PGCE), NVQ Level 4-5, HNC, HND, RSA Higher Diploma, BTEC Higher level, Foundation degree (NI), Professional qualifications (for example teaching, nursing, accountancy).
Other qualifications: Vocational/Work-related Qualifications, Foreign Qualifications (Not stated/ level unknown).

# read data
qualifications = pd.read_csv('./qualifications.csv')

# copy dataset
qualif = qualifications

# subset 
qualif = qualif.iloc[:, [1,9]]

# remane 
qualif = qualif.rename(index=str, columns={"GEO_CODE": "ward_code", "F192": "Economically Active - level_4 qualification"})

# delete first row
qualif = qualif.iloc[1:]

qualif.head()

Population Density Data

# read data
population = pd.read_csv('./residents.csv')

# copy data 
pop = population

pop = pop.iloc[:, [1,5]]
pop = pop.iloc[1:]
pop = pop.rename(index=str, columns={"GEO_CODE": "ward_code", "F2384": "population density"})
pop.head()

Unemployment Data

Economic activity relates to whether or not a person who was aged 16 and over was working or looking for work in the week before census. Rather than a simple indicator of whether or not someone was currently in employment, it provides a measure of whether or not a person was an active participant in the labour market. A person’s economic activity is derived from their ‘activity last week’. This is an indicator of their status or availability for employment - whether employed, unemployed, or their status if not employed and not seeking employment. Additional information included in the economic activity classification is also derived from information about the number of hours a person works and their type of employment - whether employed or self-employed. The census concept of economic activity is compatible with the standard for economic status defined by the International Labour Organisation (ILO). It is one of a number of definitions used internationally to produce accurate and comparable statistics on employment, unemployment and economic status.

# read data
unemployment = pd.read_csv('./unemploy.csv')

# copy data 
unemploy = unemployment

unemploy = unemploy.iloc[:, [1,5,6]]
unemploy = unemploy.iloc[1:]
unemploy = unemploy.rename(index=str, columns={"GEO_CODE": "ward_code", 
                                     "F244": "Economically Active",
                                     'F248': 'Economically Active - Unemployed'})
unemploy.head()

Universities Data

# read data 
universities = pd.read_csv('./Harris.csv')

# copy 
uni = universities

uni = uni.iloc[:, [4,6,13,15]]
uni = uni.rename(index=str, columns={"Unnamed: 4": "University Name", 
                                     'Unnamed: 6': 'University type group',
                                     'Unnamed: 13': 'posttown',
                                     'Unnamed: 15': 'postcode'})
uni = uni.iloc[1:]


uni.head()

# check the value of type group. 
uni['University type group'].unique()

array(['LA maintained schools', 'Independent schools', 'Universities',
       'Free Schools', 'Academies', 'Colleges', 'Special schools',
       'Other types'], dtype=object)

uni.head()

# filter for universities
uni['University type group'].unique()

array(['LA maintained schools', 'Independent schools', 'Universities',
       'Free Schools', 'Academies', 'Colleges', 'Special schools',
       'Other types'], dtype=object)

# filter universities
uni = uni.loc[uni['University type group']== 'Universities']

# make posttown lowercase 
uni['posttown'] = uni['posttown'].str.lower()
uni = uni.reset_index(drop = True)

uni.head()

Link Datasets¶

# recall business dataset
data_tech_15.head()

# delete more columns
data_tech_15 = data_tech_15.iloc[:, [2,3,4,5,6,9,10,11,12,15]]
data_tech_15.head()

# merge postcode dataset to get ward code
merge1 = pd.merge(data_tech_15, pcodes, on='postcode', how='inner')
merge2 = pd.merge(merge1, pop, on='ward_code', how='left')
merge3 = pd.merge(merge2, qualif, on='ward_code', how='left')
merge4 = pd.merge(merge3, unemploy, on='ward_code', how='left')

merge4.head()

# check type of variables
merge4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129112 entries, 0 to 129111
Data columns (total 15 columns):
dissolutiondate                                14794 non-null datetime64[ns]
incorporationdate                              129112 non-null datetime64[ns]
posttown                                       128057 non-null object
postcode                                       129112 non-null object
companycategory                                129112 non-null object
nummortcharges                                 129112 non-null float64
nummortoutstanding                             129112 non-null float64
nummortpartsatisfied                           129112 non-null float64
nummortsatisfied                               129112 non-null float64
closed                                         129112 non-null int64
ward_code                                      129112 non-null object
population density                             128677 non-null object
Economically Active - level_4 qualification    128677 non-null object
Economically Active                            128677 non-null object
Economically Active - Unemployed               128677 non-null object
dtypes: datetime64[ns](2), float64(4), int64(1), object(8)
memory usage: 15.8+ MB

# make last columns floats
cols = ['population density', 'Economically Active - level_4 qualification', 'Economically Active',\
        'Economically Active - Unemployed']
for col in cols:
    merge4[col] = merge4[col].astype(float)

# create new variables 
merge4['% economically active with level 4 qualification'] = merge4['Economically Active - level_4 qualification']/\
                                                            merge4['Economically Active']

merge4['% economically active'] = merge4['Economically Active']/\
                                  merge4['population density']
merge4['% economically active unemployed'] = merge4['Economically Active - Unemployed']/\
                                             merge4['Economically Active']

# attach number of companies in the same town, and number of companies in the same ward
companies_in_ward = pd.DataFrame(merge4['ward_code'].value_counts())
companies_in_ward = companies_in_ward.reset_index(drop=False)
companies_in_ward = companies_in_ward.rename(index=str, columns={"index": "ward_code",
                                                                 'ward_code':'n. of companies in ward'})



companies_in_city = pd.DataFrame(merge4['posttown'].value_counts())
companies_in_city = companies_in_city.reset_index(drop=False)
companies_in_city = companies_in_city.rename(index=str, columns={"index": "posttown",
                                                                 'posttown':'n. of companies in city'})

merge5 = pd.merge(merge4, companies_in_city, on='posttown', how='left')
merge6 = pd.merge(merge5, companies_in_ward, on='ward_code', how='left')

merge6.head()

# delete columns I won't use
merge6 = merge6.iloc[:, [0,1,2,3,4,5,6,7,8,9,11,15,16,17,18,19]]

# add number of universities in the same city
universities_in_city = pd.DataFrame(uni['posttown'].value_counts())
universities_in_city = universities_in_city.reset_index(drop=False)
universities_in_city = universities_in_city.rename(index=str, columns={"index": "posttown",
                                                                 'posttown':'n. of universities in city'})

universities_in_city.head()

# merge with data 
merge7 = pd.merge(merge6, universities_in_city, on='posttown', how='left')

total = merge7
total = total.fillna(0)
total.head()

More Data Visualisation¶

I will first plot a pairplot to see the relationships among each pair of variables.

# scatter plot
total_num = total.iloc[:,9:]
total_num = total_num.dropna()
total_num.head()

sns.pairplot(total_num, kind='scatter', hue= 'closed');

Next I will plot density plots to see the distribution of the variables for companies that are still open VS companies that are closed. this will allow me to see immediate differences in the distributions.

#create a function that will code and uncode data
def code(row):
    value = 'closed'
    if row['closed'] == 0:
        value = 'open'
    return value

def code_reverse(row):
    value = 1
    if row['closed'] == 'open':
        value = 0
    return value

%%time
total['closed'] = total.apply(code, axis = 1)

CPU times: user 12.7 s, sys: 270 ms, total: 12.9 s
Wall time: 14.7 s

# values for which to subset
closed= ['closed','open']

# Iterate through the dataset and subset for open and closed companies
for i in closed:
    subset = total[total['closed'] == i]
    
    # Draw the density plot
    sns.distplot(subset['% economically active unemployed'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'areas where companies are '+i)
    
# Plot formatting
plt.legend(prop={'size': 12}, title = '')
plt.title('Lower Unemployment Areas tend to host companies that are still open')
plt.xlabel('unemployment')
plt.ylabel('Density')

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Text(0,0.5,'Density')

The plot shows that areas where companies are still open have a lower percentage of unemployement

# values for which to subset
closed= ['closed','open']

# Iterate through the dataset and subset for open and closed companies
for i in closed:
    subset = to_vis[to_vis['closed'] == i]
    
    # Draw the density plot
    sns.distplot(subset['% economically active'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'areas where companies are '+i)
    
# Plot formatting
plt.legend(prop={'size': 11}, title = '')
plt.title('Areas where there is a greater % of economically active population tend to host open companies')
plt.xlabel('Economically active')
plt.ylabel('Density')

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Text(0,0.5,'Density')

The distribution are very similar.
**The spike at around 0.6 shows that areas where companies are still open, also have a higher percentage of people that are economically active and therefore contribute to the workforce.

# values for which to subset
closed= ['closed','open']

# Iterate through the dataset and subset for open and closed companies
for i in closed:
    subset = to_vis[to_vis['closed'] == i]
    
    # Draw the density plot
    sns.distplot(subset['% economically active with level 4 qualification'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'areas where companies are '+i)
    
# Plot formatting
plt.legend(prop={'size': 11}, title = '')
plt.title('areas where there is a lower percentage of population with university degrees tend to host closed companies')
plt.xlabel('Higher level education')
plt.ylabel('Density')

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Text(0,0.5,'Density')

The distribution shows that areas that have a lower percentage of people with a higher education degree are area where companies are currently closed (0.25).
On the contrary,at areas where 75% of economically active population has a higher education degree, there is higher density of opened companies other than closed companies.

Machine Learning¶

Logistic Regression¶

Step 1: scale data¶

I am going to scale data because the variables in the dataset are on different scales. For example, some are percentages, some are n. of morgases etc...

# scale data frame 
# columns to scale
toscale = ['nummortcharges', 'nummortoutstanding', \
'nummortpartsatisfied', 'nummortsatisfied', \
'population density','% economically active with level 4 qualification',\
'% economically active', '% economically active unemployed',\
'n. of companies in city', 'n. of companies in ward',\
'n. of universities in city']

#scale
scaled = pd.DataFrame(scale(total[toscale]),
                   index=total.index,
                   columns=toscale)

# check that standard deviation is all the same
scaled.describe().head(3)

# add variables to scaled dataset 
scaled['closed'] = total['closed']
scaled['companycategory'] = total['companycategory']

# create dummy variables 
cat_vars=['companycategory']

for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(scaled[var], prefix=var)
    data1=scaled.join(cat_list)
    data1 = data1.drop(var,1)
    scaled=data1

data_vars=scaled.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

# code variables back to 0 and 1 to run it into the regression function. 
scaled['closed'] = scaled.apply(code_reverse, axis = 1)

Step 2: Implement model and see which variables are significant and which ones aren't¶

cols = ['nummortcharges', 'nummortoutstanding', 'nummortpartsatisfied',
       'nummortsatisfied', 'population density',
       '% economically active with level 4 qualification',
       '% economically active', '% economically active unemployed',
       'n. of companies in city', 'n. of companies in ward',
       'n. of universities in city',
       'companycategory_Community Interest Company', 'companycategory_Private',
       'companycategory_Public Limited Company']

X = scaled[cols]
y = scaled['closed']

logit_model=sm.Logit(y,X.astype(float))
result=logit_model.fit()
result.summary2()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.354193
         Iterations: 35

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

The only significant variables are:

Number of mortgages outstanding
population density
% of economically active people with higher education degrees
% of economically active people over the whole population
% of economically active people that are unemployed
number of other tech companies in the same city
number of universities in the city

# keep only significant variables 
sig_cols = ['nummortoutstanding', 'population density',
       '% economically active with level 4 qualification',
       '% economically active', '% economically active unemployed',
       'n. of companies in city',
       'n. of universities in city',]

X = scaled[sig_cols]
y = scaled['closed']

logit_model=sm.Logit(y,X.astype(float))
result1=logit_model.fit()
result1.summary2()

Optimization terminated successfully.
         Current function value: 0.692488
         Iterations 4

Step 3: Train the regression¶

# split data into training set and testing set 
target = ['closed']
x_train, x_test, y_train, y_test = train_test_split(scaled[sig_cols],
                                                    scaled[target], 
                                                    test_size=0.20, random_state=0)

# baseline model 
# I am adding class_weight = balanced becasue my dataset is not balanced. 
logisticRegr = LogisticRegression(random_state=0, class_weight='balanced')
m1 = logisticRegr.fit(x_train, y_train)

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

# make prediction 
predictions_log = logisticRegr.predict(x_test)

Step 4: Test Accuracy of baseline model¶

Accuracy is defined as:

correct predictions / total number of data points

# accuracy 
score = logisticRegr.score(x_test, y_test)
print('The accuracy of the model is ' + str(score*100)+ ' %')

The accuracy of the model is 54.536653371025835 %

Step 5: Show confusion matrix¶

A confusion matrix shows the number of observations that are assigned to the right class and observations that are assigned to the wrong class

# confusion matrix
cm = metrics.confusion_matrix(y_test, predictions_log)

plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

Step 6: Improve the model with random forest¶

RF = RandomForestRegressor().fit(x_train, y_train)

/Users/gioia/anaconda3/envs/gds/lib/python3.6/site-packages/ipykernel_launcher.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """Entry point for launching an IPython kernel.

predictions_RF = RF.predict(x_test)

score_RF = RF.score(x_test, y_test)
print(score)

0.5453665337102583

Random Forest has the same accuracy of the logistic regression model

	id	companynumber	countryoforigin	dissolutiondate	incorporationdate	accountingrefday	accountingrefmonth	account_nextduedate	account_lastmadeupdate	accountscategory	...	siccode	sicdescription	return_nextduedate	return_lastmadeupdate	confirmation_nextduedate	confirmation_lastmadeupdate
0	1	07382019	UNITED KINGDOM	2018-03-01	2010-09-21	30.0	9.0	2017-06-30	2015-09-30	DORMANT	...	74100.0	specialised design activities	2016-10-19	2015-09-21	2017-10-05	2016-09-21
1	2	04753368	UNITED KINGDOM	2017-04-01	2003-05-03	30.0	4.0	2017-01-31	2014-10-31	TOTAL EXEMPTION FULL	...	82990.0	Other business support service activities n.e.c.	2017-05-31	2016-05-03	2018-05-17	NaN
2	3	SC421617	UNITED KINGDOM	NaN	2012-04-11	30.0	3.0	2017-12-30	2016-03-30	TOTAL EXEMPTION SMALL	...	70229.0	Management consultancy activities other than f...	2017-05-09	2016-04-11	2018-04-25	NaN

	companynumber	countryoforigin	dissolutiondate	incorporationdate	posttown	postcode	companycategory	companyname	companystatus	siccode
0	SC394469	UNITED KINGDOM	NaT	2011-03-01	aberdeen	AB25 2QT	Private Limited Company	'RIMS ENTERPRISE SOLUTIONS LTD	Active	62090.0
1	06134719	UNITED KINGDOM	2018-01-01	2007-03-02	london	EC2A 4NE	Private Limited Company	(BRACKET) DESIGN LTD.	Active	62020.0
2	07649435	UNITED KINGDOM	NaT	2011-05-26	london	EC4N 6NP	Private Limited Company	JUJU MEDIA LIMITED	Active	62090.0
3	06534666	UNITED KINGDOM	NaT	2008-03-14	bristol	BS1 2NT	Private Limited Company	(MID) DOTCOM LIMITED	Active	62020.0
4	05153997	UNITED KINGDOM	NaT	2004-06-15	bracknell	RG12 1QS	Private Limited Company	*+ LTD	Active	62090.0

	companynumber	countryoforigin	dissolutiondate	incorporationdate	posttown	postcode	companycategory	companyname	companystatus	nummortcharges	nummortoutstanding	nummortpartsatisfied	nummortsatisfied	siccode	companylife
0	SC394469	UNITED KINGDOM	NaT	2011-03-01	aberdeen	AB25 2QT	Private Limited Company	'RIMS ENTERPRISE SOLUTIONS LTD	Active	0.0	0.0	0.0	0.0	62090.0	NaN
1	06134719	UNITED KINGDOM	2018-01-01	2007-03-02	london	EC2A 4NE	Private Limited Company	(BRACKET) DESIGN LTD.	Active	0.0	0.0	0.0	0.0	62020.0	10.843836

	postcode	ward_code
0	AL1 1AG	E05004802
1	AL1 1AJ	E05004802
2	AL1 1AR	E05004789
3	AL1 1AS	E05004799
4	AL1 1BH	E05004802

	ward_code	Economically Active - level_4 qualification
1	95AA01	977
2	95AA02	230
3	95AA03	167
4	95AA04	634
5	95AA05	546

	ward_code	Economically Active	Economically Active - Unemployed
1	95AA01	2473	107
2	95AA02	892	57
3	95AA03	840	57
4	95AA04	1660	72
5	95AA05	1362	77

	University Name	University type group	posttown	postcode
1	Sir John Cass's Foundation Primary School	LA maintained schools	London	EC3A 5DE
2	City of London School for Girls	Independent schools	London	EC2Y 8BB
3	St Paul's Cathedral School	Independent schools	London	EC4M 9AD
4	City of London School	Independent schools	London	EC4V 3AL
5	The Charterhouse Square School	Independent schools	London	EC1M 6EA

	University Name	University type group	posttown	postcode
0	London Metropolitan University	Universities	london	N7 8DB
1	Guildhall School of Music and Drama	Universities	london	EC2Y 8DT
2	The Royal Veterinary College	Universities	london	NW1 0TU
3	Central School of Speech and Drama	Universities	london	NW3 3HY
4	University College London	Universities	london	WC1E 6BT

	dissolutiondate	incorporationdate	posttown	postcode	companycategory	closed	population density	% economically active with level 4 qualification	% economically active	% economically active unemployed	n. of companies in city	n. of companies in ward	n. of universities in city
0	2018-01-01 00:00:00	2007-03-02	london	EC2A 4NE	Private	1	13904.0	0.645165	0.596519	0.088739	23260.0	639	37.0
1	0	2011-08-19	london	EC2A 4NE	Private	0	13904.0	0.645165	0.596519	0.088739	23260.0	639	37.0
2	2017-12-01 00:00:00	2008-10-29	london	EC2A 4NE	Private	1	13904.0	0.645165	0.596519	0.088739	23260.0	639	37.0
3	0	2007-04-11	london	EC2A 4NE	Private	0	13904.0	0.645165	0.596519	0.088739	23260.0	639	37.0
4	0	2010-07-09	london	EC2A 4NE	Private	0	13904.0	0.645165	0.596519	0.088739	23260.0	639	37.0

	nummortcharges	nummortoutstanding	nummortpartsatisfied	nummortsatisfied	population density	% economically active with level 4 qualification	% economically active	% economically active unemployed	n. of companies in city	n. of companies in ward	n. of universities in city
count	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05	1.291120e+05
mean	3.874327e-17	9.685817e-18	-6.879131e-19	1.144687e-17	-9.179513e-17	2.888135e-16	5.608969e-16	-3.486894e-16	1.375826e-17	-6.339808e-17	5.354716e-17
std	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00	1.000004e+00

Model:	Logit	No. Iterations:	35.0000
Dependent Variable:	closed	Pseudo R-squared:	0.005
Date:	2018-09-25 18:25	AIC:	91489.1146
No. Observations:	129112	BIC:	91625.8727
Df Model:	13	Log-Likelihood:	-45731.
Df Residuals:	129098	LL-Null:	-45963.
Converged:	0.0000	Scale:	1.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
nummortcharges	0.6645	0.3749	1.7728	0.0763	-0.0702	1.3992
nummortoutstanding	-0.6325	0.2761	-2.2908	0.0220	-1.1736	-0.0913
nummortpartsatisfied	-0.0744	1485.9425	-0.0001	1.0000	-2912.4681	2912.3193
nummortsatisfied	-0.3745	0.1952	-1.9186	0.0550	-0.7570	0.0081
population density	-0.0274	0.0110	-2.4858	0.0129	-0.0489	-0.0058
% economically active with level 4 qualification	-0.0528	0.0116	-4.5402	0.0000	-0.0756	-0.0300
% economically active	0.0549	0.0102	5.3803	0.0000	0.0349	0.0749
% economically active unemployed	0.1267	0.0117	10.8227	0.0000	0.1038	0.1497
n. of companies in city	-0.4640	0.1370	-3.3862	0.0007	-0.7325	-0.1954
n. of companies in ward	-0.0152	0.0113	-1.3377	0.1810	-0.0374	0.0071
n. of universities in city	0.4980	0.1363	3.6538	0.0003	0.2309	0.7652
companycategory_Community Interest Company	-1.4441	4.1463	-0.3483	0.7276	-9.5707	6.6824
companycategory_Private	-2.0613	4.1354	-0.4984	0.6182	-10.1666	6.0440
companycategory_Public Limited Company	-2.4950	4.2638	-0.5852	0.5584	-10.8519	5.8619

	posttown	n. of universities in city
0	london	37
1	leeds	6
2	manchester	5
3	birmingham	5
4	liverpool	4

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
nummortoutstanding	-0.0330	0.0058	-5.6821	0.0000	-0.0444	-0.0216
population density	-0.0126	0.0069	-1.8199	0.0688	-0.0261	0.0010
% economically active with level 4 qualification	-0.0210	0.0071	-2.9643	0.0030	-0.0350	-0.0071
% economically active	0.0196	0.0061	3.1922	0.0014	0.0076	0.0316
% economically active unemployed	0.0548	0.0077	7.1405	0.0000	0.0398	0.0699
n. of companies in city	-0.2110	0.0749	-2.8186	0.0048	-0.3577	-0.0643
n. of universities in city	0.2240	0.0745	3.0047	0.0027	0.0779	0.3701