Classify Twitter's Tweets Based On Naive Bayes Algorithm

Gul Md Ershad
8y
2.6k
0
2

Article

Introduction

Naive Bayes classification algorithm of Machine Learning is a very interesting algorithm. It is used as a probabilistic method and is a very successful algorithm for learning/implementation to classify text documents. Any kind of objects can be classified based on a probabilistic model specification. This algorithm is based on Bayes' theorem. It is not a single algorithm but a family of algorithms. It comes under the category of supervised learning. It is used to predict information based on training datasets.

Earlier, I had explained about K-Nearest Neighbour algorithm. Conceptually, K-NN algorithm is based on Euclidean Distance formula, however, Naive Bayes is based on the concept of probability.

Explanation

Let us take Twitter's tweets and build a classifier based on the given tweets. This classifer will tell whether a tweet is under the category of "Politics or Sports" or not.

The basic example of the tweet data will be classifed based on the the containing texts (Table1).

Tweet Id	Text	Category
294051752079159296	99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99?	Sports
291019672701255681	On Jan 10, PM #Abe received a courtesy call from Mr. Yoshihiro Murai, Governor of Miyagi Prefecture. \nhttp://t.co/EsyP40Gl	Politics
305581742104932352	Video of last week's hot topics: #2pack, #Draghi, pensions & #drug tests. @Europarltv video http://t.co/9GVBa315vM	Politics
291520568396759041	10 off the over, 10 required! Captain Faulkner to bowl the last over, in close discussion with veteran Warne. Final spot on the line #BBL02	Sports

I have the below "training" data from Twitter's feed (Table2).

Tweet Id	Category	Text
306624404287275009	Sports	99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99?
306481199130505216	Sports	Tonight's Scottish First Division match between Dumbarton and Raith Rovers has been postponed due to a frozen pitch
304353716117590016	Politics	@GSANetwork raises awareness & stands up to stop #LGBT #bullying in school & online. http://t.co/FWIG5vvVmi @glaad
304844614517547008	Politics	Blasts Deja Vu. How many times have we been in this exact moment? Failed or ignored intel/no cctvs/blame game and innocents dead.

Below is the tweet test data that is unclassified (Table3).

TweetId	Text
301733794770190336	RT @aliwilgus: @tweetsoutloud How serious is NASA's commitment to the SLS and Orion programs, and the future of human space flight beyon ...
301576909517619200	RT @FardigJudith: This line in the President's State of the Union Address spoke to me. Check it out & share your #SOTU #CitizenRespo
256056214880919553	What is your favorite place to play badminton? Do you have a specific club in mind? Give them a shoutout! #badminton #clubs
300248062209691648	Sam wins the first game v Safarova #FedCup #AusvCze http://t.co/yjyZLnjr

I will classify or categorize the data of table2 by using Naive Bayes Algorithm with the help of Python 3 code. Let us extract the important words from the tweeted sentences like below.

def extract_tweet_words(tweet_words):
words = []
alpha_lower = string.ascii_lowercase
alpha_upper = string.ascii_uppercase
numbers = [str(n) for n in range(10)]
for word in tweet_words:
cur_word = ''
for c in word:
if (c not in alpha_lower) and (c not in alpha_upper) and (c not in numbers):
if len(cur_word) >= 2:
words.append(cur_word.lower())
cur_word = ''
continue
cur_word += c
if len(cur_word) >= 2:
words.append(cur_word.lower())
return words

Get Training data from tweet.

def get_tweet_training_data():
f = open('training.txt', 'r')
training_data = []
for l in f.readlines():
l = l.strip()
tweet_details = l.split()
tweet_id = tweet_details[0]
tweet_label = tweet_details[1]
tweet_words = extract_words(tweet_details[2:])
training_data.append([tweet_id, tweet_label, tweet_words])
f.close()
return training_data

Get test data from tweet that will be classified.

def get_tweet_test_data():
f = open('test.txt', 'r')
validation_data = []
for l in f.readlines():
l = l.strip()
tweet_details = l.split(' ')
tweet_id = tweet_details[0]
tweet_words = extract_words(tweet_details[1:])
validation_data.append([tweet_id, '', tweet_words])
f.close()
return validation_data

Get list of words in the training data.

def get_words(training_data):
words = []
for data in training_data:
words.extend(data[2])
return list(set(words))

Get the probability of each word in the training data of tweet.

def get_tweet_word_prob(training_data, label = None):
words = get_words(training_data)
freq = {}
for word in words:
freq[word] = 1
total_count = 0
for data in training_data:
if data[1] == label or label == None:
total_count += len(data[2])
for word in data[2]:
freq[word] += 1
prob = {}
for word in freq.keys():
prob[word] = freq[word]*1.0/total_count
return prob

Get probability of a given label.

def get_tweet_label_count(training_data, label):
count = 0
total_count = 0
for data in training_data:
total_count += 1
if data[1] == label:
count += 1
return count*1.0/total_count

Apply Naive Bayes Model like below.

def label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob):
labels = []
for data in test_data:
data_prob_sports = sports_prob
data_prob_politics = politics_prob
for word in data[2]:
if word in sports_word_prob:
data_prob_sports *= sports_word_prob[word]
data_prob_politics *= politics_word_prob[word]
else:
continue
if data_prob_sports >= data_prob_politics:
labels.append([data[0], 'Sports', data_prob_sports, data_prob_politics])
else:
labels.append([data[0], 'Politics', data_prob_sports, data_prob_politics])
return labels

Print the labelled or categorze the test data like below.

def print_labelled_data(labels):
f_out = open('test_labelled_output.txt', 'w')
for [tweet_id, label, prob_sports, prob_politics] in labels:
f_out.write('%s %s\n' % (tweet_id, label))
f_out.close()

Read the training and test data like below.

training_data = get_tweet_training_data()
test_data = get__tweet_test_data()

Get the probability of each word.

word_prob = get_tweet_word_prob(training_data)
sports_word_prob = get_tweet_word_prob(training_data, 'Sports')
politics_word_prob = get_tweet_word_prob(training_data, 'Politics')

Get the probability of each label.

sports_prob = get_tweet_label_count(training_data, 'Sports')
politics_prob = get_tweet_label_count(training_data, 'Politics')

Normalize for stop words.

for (word, prob) in word_prob.items():
sports_word_prob[word] /= prob
politics_word_prob[word] /= prob

Label the test data and print it.

test_labels = label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob)
print_labelled_data(test_labels)

Output example of this algorithm will be something like below.

TweetId	Category
301733794770190336	Politics
301576909517619200	Politics
305057161682227200	Sports
286543227178328066	Politics

I have attached the complete Python code with test data, training data, and output categoried/labelled data. You can also generate the output data by running this Python code of Machine learning.

Pre-requisites for the running this code include -

Python 3.5
Juypter notebook will be good.

Conclusion

Naive Bayes algorithm is based on probability and it is very good for the labellig of data.