This project is part of the course work for Dr. Nikos Altras's class Natural Language processing.
The goal of this project is to implement the binary perceptron algorithm and to apply it to sentiment analysis of movie reviews. The data set used for this analysis can be found here and it contains 1000 positive and 1000 negative movie reviews.
This report is structured as follow:
The perceptron is a classic learning algorithm for neural model of learning and it is used for learning features’ weights. The algorithm has two main characteristics:
The algorithm has a training and a testing phase. The training phase starts by assigning the value of zero to each feature weight. This zero weight vector, along with the labeled training set, is fed into the perceptron training algorithm. For any given example of the training set, the algorithm makes a prediction using a prediction function and checks if the prediction is correct. If it is, it does nothing. If it is wrong, the algorithm changes the parameters (in this case the feature weights) in such a way that it would be right next time around in the same example. It then goes to the next example, repeats the procedure, and continues until the end of the data set. At this point the algorithm will loop back around for a pre-specified number of iterations. After the multiple passes through the data set, the algorithm returns a vector of updated weights.
The performance of the algorithm is assessed in the testing phase. This time, the algorithm will use the updated features weights returned in the training phase and make prediction over an unseen data set. Accuracy is then returned by calculating the number of correctly classified instances over the the total number of examples in the data set.
Each review in the data set is labelled (+1) when positive and (-1) when negative. In this specific case, in its training phase, the perceptron aims at updating the weight of each word that make up the movie review. The idea is that each review is made of a selection of words. Each word has a weight and a frequency. The score of a word is its weight multiplied by its frequency. The overall score of a review is made up of the sum of each score of each word in the review.
To make a prediction, the algorithm calculates the score for each review and uses the sign function (sign(score)) to return +1 if the score is positive or -1 if the score is negative. The training algorithm will use these predictions to update the weight of each word. The testing algorithm assesses whether the updated weights make a correct prediction on unseen data or not.
# required packages
import os
import re
from collections import Counter
import numpy as np
from random import shuffle
from random import seed
import matplotlib.pyplot as plt
from operator import itemgetter
from datetime import datetime
counter_all = []
list_of_dictionaries = []
# create variables with path for positive and negative files
pospath = "./review_polarity/txt_sentoken/pos"
negpath = "./review_polarity/txt_sentoken/neg"
# function that reads the data from the directory and returns a list of text.
def read_text(filepath):
text_list = []
for file in os.listdir(filepath):
# generate the full file path
fp = os.path.join(filepath, file)
# append in list paths
with open(fp, "r") as infile:
# read file
text = infile.read()
text = text.lower()
text_list.append(text)
return text_list
# function that generates ngrams.
# function taken from here: http://www.albertauyeung.com/post/generating-ngrams-python/
def generate_ngrams(s, n=1):
# Replace all none alphanumeric characters with spaces
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
# Break sentence in the token, remove empty tokens
tokens = [token for token in s.split(" ") if token != ""]
# Use the zip function to help us generate n-grams
# Concatentate the tokens into ngrams and return
ngrams = zip(*[tokens[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
# function that generate word frequencies
def get_BOW(label, n=1):
# select the file path depending on the label
if label == 1:
filepath = pospath
else:
filepath = negpath
repository = []
# read data and return text list
text_list = read_text(filepath)
# for each text in the list,
for text in text_list:
# create empty dictionary
dictionary = {}
# create bag of words with counter
counter = Counter(generate_ngrams(text, n))
# append word counter to the dictionary
dictionary['words'] = counter
# label text and attach to dictionary
dictionary['label'] = label
# append counter to list of counters
counter_all.append(counter)
repository.append(counter)
# append dictionary to list of all dictionaries
list_of_dictionaries.append(dictionary.copy())
return repository
# function that generates zero weights
def zero_weights(train_list):
w_dict = {}
# for each counter in the list,
for BOW in train_list:
# for each word, count in each counter
for word in BOW['words']:
# take the word and assign it a weight of 0
w = {word: 0.0}
# update each word into the empty w_dict
w_dict.update(w)
return w_dict
# function that calculates the accuracy of the binary perceptron/test function
# the imput of this dunction will be the test set and the updated weights.
# the test set is given in the form of a list of dictionaries containing
# labelled test text
def binary_perceptron(dict_list, W):
correct = 0.0
# for each element of the list
for dictionary in dict_list:
y = dictionary['label']
score = 0.0
# for each work in the element
for word in dictionary['words']:
# calculate word score
score += dictionary['words'][word] * W[word]
# print(score)
# print("positive") if score >= 0.0 else print("negative")
# make score either positive or negative
score = np.sign(score)
if score == 0:
score = 1
if score == y:
correct += 1
else:
continue
# calculate correct instances over all instances
accuracy = correct / (int(len(dict_list)))
# return accuracy
return accuracy
# function that updates the weights / trains the model
# the imput of the function will be the training dataset and the zero weights dictionary
def update_weights(train_list, W):
# for each text in the list
for dictionary in train_list:
y = dictionary['label']
y_hat = 0.0
# for each word in the text
for word in dictionary['words']:
# calculate prediction score for the word
y_hat += dictionary['words'][word] * W[word]
# make the prediction either 1 or -1
if y_hat >= 0:
y_hat = 1
else:
y_hat = -1
# if the prediction is different than the actual label,
if y_hat != y:
# if the actual label is 1
if y == 1:
# update the weight by adding the word frequency to it
for word in dictionary['words']:
W[word] += dictionary['words'][word]
# else subtract
else:
for word in dictionary['words']:
W[word] -= dictionary['words'][word]
# return the updated weights
return W
# function that updates the weights while shuffling the data
def update_weights_shuffle(dict_list, W):
seed(3333) # set seed so that results are reproducible
shuffle(dict_list) # shuffle
shuffled = update_weights(dict_list, W) # update
return shuffled
# function that updates the weights from shuffled data and with multiple passes
def update_weights_iterations(dict_list, W, iterations=10):
for i in range(iterations + 1):
shuffled = update_weights_shuffle(dict_list, W)
return shuffled
# this function returns a list of accuracy after each iteration
def print_accuracy(train_set, test_set, W, iterations):
accuracies = []
for i in range(iterations + 1):
update = update_weights_shuffle(train_set, W)
accuracy = binary_perceptron(test_set, update)
accuracies.append(accuracy)
return accuracies
# this function return the average weight after n iterations
def get_average_weights(dict_list, W, iterations=10):
# create list of dictionaries
ws = []
for i in range(iterations + 1):
shuffled = update_weights_shuffle(dict_list, W)
# once the wrights are updated, append them to the list of dictionaries
ws.append(shuffled.copy())
# take average of values of dictionaries from the list
# code from stackoverflow
# https://stackoverflow.com/questions/34139512/average-value-in-multiple-dictionaries-based-on-key-in-python
sums = Counter()
counters = Counter()
for itemset in ws:
sums.update(itemset)
counters.update(itemset.keys())
avg = {x: float(sums[x]) / counters[x] for x in sums.keys()}
# return average
return avg
def initialize(n=1):
pos = get_BOW(label = 1, n = 1)
neg = get_BOW(label = -1, n= 1)
return pos, neg
pos_BOW, neg_BOW = initialize( n = int(input('type 1 for unigrams, 2 for bigrams, 3 for trigrams')))
# create dictionary of zero weights
zero_weight = zero_weights(list_of_dictionaries)
# get accuracy with 0 weights
pred = binary_perceptron(list_of_dictionaries, zero_weight)
# check accuracy with 0 weights
print('Accuracy Score with zero weights is: ', str(pred * 100), "%\n")
# merge training positive and negative dictionaries to form traning and test set.
train_set = list_of_dictionaries[0:800] + list_of_dictionaries[1000:1800]
test_set = list_of_dictionaries[800:1000] + list_of_dictionaries[1800:2000]
# update the weights
new_W = update_weights(train_set, zero_weight)
# check accuracy
pred_new_W = binary_perceptron(test_set, new_W)
print('Accuracy Score when weights are updated is: ', str(pred_new_W * 100), '%\n')
new_W_SH = update_weights_shuffle(train_set,
zero_weights(list_of_dictionaries)) # update the weights with shuffle function
pred_new_W_sh = binary_perceptron(test_set, new_W_SH) # check accuracy
print('Accuracy score with shuffled data is: ', str(pred_new_W_sh * 100), '%\n')
iterations = 15 # update w with 15 iterations
new_W_iter = update_weights_iterations(train_set, zero_weights(list_of_dictionaries), iterations)
pred_new_W_iter = binary_perceptron(test_set, new_W_iter) # check accuracy
print(
'Accuracy score with shuffled data and calculated over ', str(iterations), ' iterations: ', str(pred_new_W_iter * 100),
'%\n')
# make grath to show the learning process at each iteration
x = list(range(iterations + 1)) # put iterations on the x axis
y = print_accuracy(train_set, test_set, zero_weights(list_of_dictionaries),
iterations) # put accuracy scores on the y axis
## plot the accuracy per iteration.
fig, ax = plt.subplots()
plt.xlabel('iterations')
plt.ylabel('accuracy score')
plt.title('Accuracy increases as the number of iterations increases')
plt.grid(True)
ax.plot(x, y, '-', lw=2)
# save image in directory
directory = './review_polarity/txt_sentoken/'
plt.savefig(directory + 'image.png', dpi=300)
# show graph
plt.show()
new_W_av = get_average_weights(train_set, zero_weights(list_of_dictionaries), iterations) # update w
pred_new_W_avg = binary_perceptron(test_set,
new_W_av) # make new prediction and check for accuracy
print('Accuracy Score with weights averaged over ', str(iterations), ' iterations: ', str(pred_new_W_avg * 100), '% \n')
# what are the most positive and the most negative words?
positive = sorted(new_W_av.items(), key=itemgetter(1))[::-1]
negative = sorted(new_W_av.items(), key=itemgetter(1))
print('Top 10 positive Words: ', positive[:10], '\n')
print('Top 10 negative words: ', negative[:10], '\n')