Classify Twitter's Tweets Based On Naive Bayes Algorithm

Introduction

Naive Bayes classification algorithm of Machine Learning is a very interesting algorithm. It is used as a probabilistic method and is a very successful algorithm for learning/implementation to classify text documents. Any kind of objects can be classified based on a probabilistic model specification. This algorithm is based on Bayes' theorem. It is not a single algorithm but a family of algorithms. It comes under the category of supervised learning. It is used to predict information based on training datasets.  

Earlier, I had explained about K-Nearest Neighbour algorithm. Conceptually, K-NN algorithm is based on Euclidean Distance formula, however, Naive Bayes is based on the concept of probability.

Explanation

Let us take Twitter's tweets and build a classifier based on the given tweets. This classifer will tell whether a tweet is under the category of "Politics or Sports" or not.
 
The basic example of the tweet data will be classifed based on the the containing texts (Table1).
 
 Tweet Id Text Category
 294051752079159296 99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? Sports
 291019672701255681 On Jan 10, PM #Abe received a courtesy call from Mr. Yoshihiro Murai, Governor of Miyagi Prefecture. \nhttp://t.co/EsyP40Gl Politics
 305581742104932352 Video of last week's hot topics: #2pack, #Draghi, pensions & #drug tests. @Europarltv video http://t.co/9GVBa315vM Politics
 291520568396759041 10 off the over, 10 required! Captain Faulkner to bowl the last over, in close discussion with veteran Warne. Final spot on the line #BBL02 Sports

I have the below "training" data from Twitter's feed (Table2).
 Tweet Id Category Text
 306624404287275009 Sports 99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99?
 306481199130505216 Sports Tonight's Scottish First Division match between Dumbarton and Raith Rovers has been postponed due to a frozen pitch
 304353716117590016 Politics @GSANetwork raises awareness & stands up to stop #LGBT #bullying in school & online. http://t.co/FWIG5vvVmi @glaad
 304844614517547008 Politics Blasts Deja Vu. How many times have we been in this *exact* moment? Failed or ignored intel/no cctvs/blame game and innocents dead.
 
Below is the tweet test data that is unclassified (Table3).
 
 TweetId Text
 301733794770190336 RT @aliwilgus: @tweetsoutloud How serious is NASA's commitment to the SLS and Orion programs, and the future of human space flight beyon ...
 301576909517619200 RT @FardigJudith: This line in the President's State of the Union Address spoke to me. Check it out & share your #SOTU #CitizenRespo
 256056214880919553 What is your favorite place to play badminton? Do you have a specific club in mind? Give them a shoutout! #badminton #clubs
 300248062209691648 Sam wins the first game v Safarova #FedCup #AusvCze http://t.co/yjyZLnjr

I will classify or categorize the data of table2 by using Naive Bayes Algorithm with the help of Python 3 code. Let us extract the important words from the tweeted sentences like below.
  1. def extract_tweet_words(tweet_words):  
  2.     words = []  
  3.     alpha_lower = string.ascii_lowercase  
  4.     alpha_upper = string.ascii_uppercase  
  5.     numbers = [str(n) for n in range(10)]  
  6.     for word in tweet_words:  
  7.         cur_word = ''  
  8.         for c in word:  
  9.             if (c not in alpha_lower) and (c not in alpha_upper) and (c not in numbers):  
  10.                 if len(cur_word) >= 2:  
  11.                     words.append(cur_word.lower())  
  12.                 cur_word = ''  
  13.                 continue  
  14.             cur_word += c  
  15.         if len(cur_word) >= 2:  
  16.             words.append(cur_word.lower())  
  17.     return words  
Get Training data from tweet.
  1. def get_tweet_training_data():  
  2.     f = open('training.txt''r')  
  3.     training_data = []  
  4.     for l in f.readlines():  
  5.         l = l.strip()  
  6.         tweet_details = l.split()  
  7.         tweet_id = tweet_details[0]  
  8.         tweet_label = tweet_details[1]  
  9.         tweet_words = extract_words(tweet_details[2:])  
  10.         training_data.append([tweet_id, tweet_label, tweet_words])  
  11.       
  12.     f.close()  
  13.       
  14.     return training_data  
Get test data from tweet that will be classified.
  1. def get_tweet_test_data():  
  2.     f = open('test.txt''r')  
  3.     validation_data = []  
  4.     for l in f.readlines():  
  5.         l = l.strip()  
  6.         tweet_details = l.split(' ')  
  7.         tweet_id = tweet_details[0]  
  8.         tweet_words = extract_words(tweet_details[1:])  
  9.         validation_data.append([tweet_id, '', tweet_words])  
  10.   
  11.     f.close()  
  12.   
  13.     return validation_data  
Get list of words in the training data.
  1. def get_words(training_data):  
  2.     words = []  
  3.     for data in training_data:  
  4.         words.extend(data[2])  
  5.     return list(set(words))  
Get the probability of each word in the training data of tweet.
  1. def get_tweet_word_prob(training_data, label = None):  
  2.     words = get_words(training_data)  
  3.     freq = {}  
  4.   
  5.     for word in words:  
  6.         freq[word] = 1  
  7.   
  8.     total_count = 0  
  9.     for data in training_data:  
  10.         if data[1] == label or label == None:  
  11.             total_count += len(data[2])  
  12.             for word in data[2]:  
  13.                 freq[word] += 1  
  14.   
  15.     prob = {}  
  16.     for word in freq.keys():  
  17.         prob[word] = freq[word]*1.0/total_count  
  18.   
  19.     return prob  
Get probability of a given label.
  1. def get_tweet_label_count(training_data, label):  
  2.     count = 0  
  3.     total_count = 0  
  4.     for data in training_data:  
  5.         total_count += 1  
  6.         if data[1] == label:  
  7.             count += 1  
  8.     return count*1.0/total_count  
Apply Naive Bayes Model like below.
  1. def label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob):  
  2.     labels = []  
  3.     for data in test_data:  
  4.         data_prob_sports = sports_prob  
  5.         data_prob_politics = politics_prob  
  6.           
  7.         for word in data[2]:  
  8.             if word in sports_word_prob:  
  9.                 data_prob_sports *= sports_word_prob[word]  
  10.                 data_prob_politics *= politics_word_prob[word]  
  11.             else:  
  12.                 continue  
  13.   
  14.         if data_prob_sports >= data_prob_politics:  
  15.             labels.append([data[0], 'Sports', data_prob_sports, data_prob_politics])  
  16.         else:  
  17.             labels.append([data[0], 'Politics', data_prob_sports, data_prob_politics])  
  18.   
  19.     return labels  
Print the labelled or categorze the test data like below.
  1. def print_labelled_data(labels):  
  2.     f_out = open('test_labelled_output.txt''w')  
  3.     for [tweet_id, label, prob_sports, prob_politics] in labels:  
  4.         f_out.write('%s %s\n' % (tweet_id, label))  
  5.   
  6.     f_out.close()  
Read the training and test data like below.
  1. training_data = get_tweet_training_data()  
  2. test_data = get__tweet_test_data()  
Get the probability of each word.
  1. word_prob = get_tweet_word_prob(training_data)  
  2. sports_word_prob = get_tweet_word_prob(training_data, 'Sports')  
  3. politics_word_prob = get_tweet_word_prob(training_data, 'Politics')  
Get the probability of each label.
  1. sports_prob = get_tweet_label_count(training_data, 'Sports')  
  2. politics_prob = get_tweet_label_count(training_data, 'Politics')  
Normalize for stop words.
  1. for (word, prob) in word_prob.items():  
  2.     sports_word_prob[word] /= prob  
  3.     politics_word_prob[word] /= prob  
Label the test data and print it.
  1. test_labels = label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob)  
  2. print_labelled_data(test_labels)  
Output example of this algorithm will be something like below.
 
 TweetId Category
 301733794770190336 Politics
 301576909517619200 Politics
 305057161682227200 Sports
 286543227178328066 Politics
 
I have attached the complete Python code with test data, training data, and output categoried/labelled data. You can also generate the output data by running this Python code of Machine learning.
 
Pre-requisites for the running this code include -
  1. Python 3.5
  2. Juypter notebook will be good.
Conclusion

Naive Bayes algorithm is based on probability and it is very good for the labellig of data. 

Up Next
    Ebook Download
    View all
    Learn
    View all