Introduction
Naive Bayes classification algorithm of Machine Learning is a very interesting algorithm. It is used as a probabilistic method and is a very successful algorithm for learning/implementation to classify text documents. Any kind of objects can be classified based on a probabilistic model specification. This algorithm is based on Bayes' theorem. It is not a single algorithm but a family of algorithms. It comes under the category of supervised learning. It is used to predict information based on training datasets.
Earlier, I had explained about K-Nearest Neighbour algorithm. Conceptually, K-NN algorithm is based on Euclidean Distance formula, however, Naive Bayes is based on the concept of probability.
Explanation
Let us take Twitter's tweets and build a classifier based on the given tweets. This classifer will tell whether a tweet is under the category of "Politics or Sports" or not.
The basic example of the tweet data will be classifed based on the the containing texts (Table1).
Tweet Id | Text | Category |
294051752079159296 | 99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? | Sports |
291019672701255681 | On Jan 10, PM #Abe received a courtesy call from Mr. Yoshihiro Murai, Governor of Miyagi Prefecture. \nhttp://t.co/EsyP40Gl | Politics |
305581742104932352 | Video of last week's hot topics: #2pack, #Draghi, pensions & #drug tests. @Europarltv video http://t.co/9GVBa315vM | Politics |
291520568396759041 | 10 off the over, 10 required! Captain Faulkner to bowl the last over, in close discussion with veteran Warne. Final spot on the line #BBL02 | Sports |
I have the below "training" data from Twitter's feed (Table2).
Tweet Id | Category | Text |
306624404287275009 | Sports | 99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? |
306481199130505216 | Sports | Tonight's Scottish First Division match between Dumbarton and Raith Rovers has been postponed due to a frozen pitch |
304353716117590016 | Politics | @GSANetwork raises awareness & stands up to stop #LGBT #bullying in school & online. http://t.co/FWIG5vvVmi @glaad |
304844614517547008 | Politics | Blasts Deja Vu. How many times have we been in this *exact* moment? Failed or ignored intel/no cctvs/blame game and innocents dead. |
Below is the tweet test data that is unclassified (Table3).
TweetId | Text |
301733794770190336 | RT @aliwilgus: @tweetsoutloud How serious is NASA's commitment to the SLS and Orion programs, and the future of human space flight beyon ... |
301576909517619200 | RT @FardigJudith: This line in the President's State of the Union Address spoke to me. Check it out & share your #SOTU #CitizenRespo |
256056214880919553 | What is your favorite place to play badminton? Do you have a specific club in mind? Give them a shoutout! #badminton #clubs |
300248062209691648 | Sam wins the first game v Safarova #FedCup #AusvCze http://t.co/yjyZLnjr |
I will classify or categorize the data of table2 by using Naive Bayes Algorithm with the help of Python 3 code. Let us extract the important words from the tweeted sentences like below.
- def extract_tweet_words(tweet_words):
- words = []
- alpha_lower = string.ascii_lowercase
- alpha_upper = string.ascii_uppercase
- numbers = [str(n) for n in range(10)]
- for word in tweet_words:
- cur_word = ''
- for c in word:
- if (c not in alpha_lower) and (c not in alpha_upper) and (c not in numbers):
- if len(cur_word) >= 2:
- words.append(cur_word.lower())
- cur_word = ''
- continue
- cur_word += c
- if len(cur_word) >= 2:
- words.append(cur_word.lower())
- return words
Get Training data from tweet.
- def get_tweet_training_data():
- f = open('training.txt', 'r')
- training_data = []
- for l in f.readlines():
- l = l.strip()
- tweet_details = l.split()
- tweet_id = tweet_details[0]
- tweet_label = tweet_details[1]
- tweet_words = extract_words(tweet_details[2:])
- training_data.append([tweet_id, tweet_label, tweet_words])
-
- f.close()
-
- return training_data
Get test data from tweet that will be classified.
- def get_tweet_test_data():
- f = open('test.txt', 'r')
- validation_data = []
- for l in f.readlines():
- l = l.strip()
- tweet_details = l.split(' ')
- tweet_id = tweet_details[0]
- tweet_words = extract_words(tweet_details[1:])
- validation_data.append([tweet_id, '', tweet_words])
-
- f.close()
-
- return validation_data
Get list of words in the training data.
- def get_words(training_data):
- words = []
- for data in training_data:
- words.extend(data[2])
- return list(set(words))
Get the probability of each word in the training data of tweet.
- def get_tweet_word_prob(training_data, label = None):
- words = get_words(training_data)
- freq = {}
-
- for word in words:
- freq[word] = 1
-
- total_count = 0
- for data in training_data:
- if data[1] == label or label == None:
- total_count += len(data[2])
- for word in data[2]:
- freq[word] += 1
-
- prob = {}
- for word in freq.keys():
- prob[word] = freq[word]*1.0/total_count
-
- return prob
Get probability of a given label.
- def get_tweet_label_count(training_data, label):
- count = 0
- total_count = 0
- for data in training_data:
- total_count += 1
- if data[1] == label:
- count += 1
- return count*1.0/total_count
Apply Naive Bayes Model like below.
- def label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob):
- labels = []
- for data in test_data:
- data_prob_sports = sports_prob
- data_prob_politics = politics_prob
-
- for word in data[2]:
- if word in sports_word_prob:
- data_prob_sports *= sports_word_prob[word]
- data_prob_politics *= politics_word_prob[word]
- else:
- continue
-
- if data_prob_sports >= data_prob_politics:
- labels.append([data[0], 'Sports', data_prob_sports, data_prob_politics])
- else:
- labels.append([data[0], 'Politics', data_prob_sports, data_prob_politics])
-
- return labels
Print the labelled or categorze the test data like below.
- def print_labelled_data(labels):
- f_out = open('test_labelled_output.txt', 'w')
- for [tweet_id, label, prob_sports, prob_politics] in labels:
- f_out.write('%s %s\n' % (tweet_id, label))
-
- f_out.close()
Read the training and test data like below.
- training_data = get_tweet_training_data()
- test_data = get__tweet_test_data()
Get the probability of each word.
- word_prob = get_tweet_word_prob(training_data)
- sports_word_prob = get_tweet_word_prob(training_data, 'Sports')
- politics_word_prob = get_tweet_word_prob(training_data, 'Politics')
Get the probability of each label.
- sports_prob = get_tweet_label_count(training_data, 'Sports')
- politics_prob = get_tweet_label_count(training_data, 'Politics')
Normalize for stop words.
- for (word, prob) in word_prob.items():
- sports_word_prob[word] /= prob
- politics_word_prob[word] /= prob
Label the test data and print it.
- test_labels = label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob)
- print_labelled_data(test_labels)
Output example of this algorithm will be something like below.
TweetId | Category |
301733794770190336 | Politics |
301576909517619200 | Politics |
305057161682227200 | Sports |
286543227178328066 | Politics |
I have attached the complete Python code with test data, training data, and output categoried/labelled data. You can also generate the output data by running this Python code of Machine learning.
Pre-requisites for the running this code include -
- Python 3.5
- Juypter notebook will be good.
Conclusion
Naive Bayes algorithm is based on probability and it is very good for the labellig of data.