Sentiment Analysis on Reddit Headlines: Intro to Python’s NLTK

Sentiment Analysis on Reddit News Headlines with Python’s Natural Language Toolkit (NLTK)

Grab the code to this tutorial on GitHub.

In my last post, K-Means Clustering with Python, we just grabbed some precompiled data, but for this post, I wanted to get deeper into actually getting some live data. Using the Reddit API we can get thousands of headlines from various news subreddits and start to have some fun with Sentiment Analysis

Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text. The opinions regarding a particular topic are usually positive, negative, or neutral.

Gathering the Dataset

In this post, instead of providing you with the dataset itself, I'll will show you how to gather your own data. Technically you can download the text files right now, but I suggest doing this manually. This tutorial will be based off of the latest political news headlines using Reddit’s API.

Before I get started, you will need to install the Natural Language Toolkit (NLTK) python package. To see how to install NLTK, you can go here: http://www.nltk.org/install.html.

Let’s start with some basic rules about Reddit’s API. First of all, in order for you to be able to gather data without alarming the Reddit that you are a bot, you need to sign up (https://www.reddit.com/register/). Then you need to follow the next lines of code in order to gather the JSON file.

import requests
import json
import time
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

# Set your header according to the form below
# :: (by /u/)

# Add your username below
hdr = {'User-Agent': 'windows:r/politics.single.result:v1.0' +
       '(by /u/)'}
url = 'https://www.reddit.com/r/politics/.json'
req = requests.get(url, headers=hdr)
json_data = json.loads(req.text)

Keep in mind that you need to follow some rules regarding the Reddit’s API, which you can find here: https://github.com/reddit/reddit/wiki/API.

    Two of the most important ones are:

  • Do not exceed the limit of 60 requests per minute
  • Do not lie about your User-Agent

Finally, if you want to commercialize your application you should definitely get Reddit’s approval before publishing.

Reddit's API data is all JSON output, let's pretty print to get a better idea on our data structure:

posts = json.dumps(json_data['data']['children'], indent=4, sort_keys=True)
print(posts)

Out: NLTK’s Vader Sentiment Analyzer Output

We can get 25 posts per request, so we can use this simple while loop to get our desirable number of posts. As you can see, the loop runs until it gathers 1000 headlines. For each request, we'll have to wait 2 seconds in order not to violate the 30 request per minute rule. Finally, let's concatenate the ‘name’ field shown above, to create the appropriate URL for the next page:

data_all = json_data['data']['children']
num_of_posts = 0
while len(data_all) <= 100:
    time.sleep(2)
    last = data_all[-1]['data']['name']
    url = 'https://www.reddit.com/r/politics/.json?after=' + str(last)
    req = requests.get(url, headers=hdr)
    data = json.loads(req.text)
    data_all += data['data']['children']
    if num_of_posts == len(data_all):
        break
    else:
        num_of_posts = len(data_all)

Labeling our Data

Now that we have gathered our data, we'll need to somehow categorize them as either positive or negative. Usually, the given dataset in a sentiment analysis project is already labeled, but in our case, I'm going to label our data using a text analyzer. This is where NLTK comes into play.

We're going to use NLTK’s build-in Vader Sentiment Analyzer, which simply ranks a piece of text as positive, negative or neutral using a list of positive and negative words:

sia = SIA()
pos_list = []
neg_list = []
for post in data_all:
    res = sia.polarity_scores(post['data']['title'])
	
	print(res)
    
    if res['compound'] > 0.2:
        pos_list.append(post['data']['title'])
    elif res['compound'] < -0.2:
        neg_list.append(post['data']['title'])

with open("pos_news_titles.txt", "w", encoding='utf-8',
          errors='ignore') as f_pos:
    for post in pos_list:
        f_pos.write(post + "\n")

with open("neg_news_titles.txt", "w", encoding='utf-8',
          errors='ignore') as f_neg:
    for post in neg_list:
        f_neg.write(post + "\n")

Out: Reddit API Json Output

As you can see from above, our result consists of 4 elements. Neu, Neg and Pos which represent the sentiment score percentage of each category in our headline.

There's also something called compound, which is a feature allowing us to label our result in a more flexible way. It ranges from -1 (Extremely Negative) to 1 (Extremely Positive). For our purpose, I will consider posts with compound value greater than 0.2 as positive and less than -0.2 as negative. This is just a simple assumption, so feel free to change that. But be careful, because there is a trade-off to be made here. If you choose a higher value, you might get more compact results (less false positives and false negatives), but the size of the results will decrease significantly.

Now, let's create a Sentiment Intensity Analyzer (SIA) which will categorize our headlines. Next, we can run a for loop for each post (post’s structure is shown in previous image) we gathered, and pass each post’s title to the analyzer. Then, each individual result is stored into res variable. Finally, we can gather all the negative and positive titles, which let's us store them to separate files.

Dataset Statistics

Sentiment Intensity Analyzer Categories Distribution

Before I start our further analysis, let’s take a look at the categories distribution. As you can see from the image above, the neutral category is the dominant one with over 45%. Next in line is the negative one with approximately 32% and in the last place is positive with 23%.

The large number of neutral headlines is due to two main reasons: First and most important, is the assumption I made earlier that headlines with compound value betIen 0.2 and -0.2 are considered neutral. The higher the margin, the larger the number of neutral headlines. The second reason, is that I used general lexicon to categorize the political news. The more correct way, is to use a political specific lexicon to do so, but that might be too expensive to create since we need a human with political knowledge to go through each of the headlines and manually characterize each of them, so we work with what we have.

Another interesting observation is the number of negative headlines. As it is mentioned before, the number of negative headlines is approximately 1.5 times the number of positive ones, and this can be interpreted in many ways. Maybe it's the media’s behavior (exaggerating titles for clickbait) or maybe our analyzer produced more false negatives.

Tokenizers & Stopwords

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
import math

example = "This is an example sentence! However, it " \
          "is a very informative one,"

print(word_tokenize(example, language='english'))

NLTK Tokenize Output 1

Now that we gathered and labeled the data, let's talk about some of the basics of preprocessing data to help us get a clearer understanding of our data.

First of all, let’s talk about tokenizers. Tokenization is the process of breaking a stream of text up into meaningful elements called tokens. You can tokenize a paragraph into sentences, a sentence into words and so on. In our case, I have headlines (sentences), so I will use a word tokenizer:

print(word_tokenize(example, language='english'))
tokenizer = RegexpTokenizer(r'\w+')
print(tokenizer.tokenize(example))

NLTK Tokenize Output 2

As you can see, the previous tokenizer, treats punctuation as words. In some cases, you might want to get rid of the punctuation, and if that’s the case, you will need to use this tokenizer.

There are many available tokenizers, so if aforementioned ones did not do the job, you can check the full list of tokenizers here: http://www.nltk.org/api/nltk.tokenize.html.

Also notice that we tend to use words like ’the, ’is’, ’and’, ’what’, etc. that are somewhat irrelevant to our cause of characterizing a text as positive or negative since they don’t offer us any valuable information. You can choose the language you want for your stopwords. In our case, we're using ‘english’ and if you want to take a peek just use the following code (approx. 150 words):

stop_words = set(stopwords.words('english'))
# print(stop_words)

Word Distribution (Positive)

Next up, we need to gather and store all the positive words (meaning the words on positive headlines), and try to extract any valuable insight on them. To read and store the words we can do the following:

all_words_pos = []
with open("pos_news_titles.txt", "r", encoding='utf-8',
          errors='ignore') as f_pos:
    for line in f_pos.readlines():
        words = tokenizer.tokenize(line)
        for w in words:
            if w.lower() not in stop_words:
                all_words_pos.append(w.lower())

Now, let’s see the frequency of each word and try to extract any knowledge:

pos_res = nltk.FreqDist(all_words_pos)
print(pos_res.most_common(8))

Word Distribution Output 1

Whad'ya know, the most positive headline words are 'Donald Trump'. That may come as a surprise for some of you (due to the fuss around the President’s life), but not all news regarding Donald Trump is ‘bad’. Also, do not forget that results may contain some false positives. The second most interesting word is ‘health’, which could be due to the current problems around the new healthcare program, but let’s take a look to make sure:

Reddit Headlines Output 1

As expected, the importance of word health comes from the latest situation on the healthcare system. Another thing to notice, is that some of the reviews appear to have more negative than positive sentiment for someone that has any political knowledge. Unfortunately, that knowledge is not incorporated in our characterizing task, as mentioned before.

But, let’s look at more macroscopic side. We can plot the frequency distribution and try to examine the pattern of words and not each word specifically.

Positive Word Frequency Distribution

The above chart, is showing the frequency patterns. More specifically, the y-axis is the frequency of the words and in x-axis is the words ranked by their frequency. So, the most frequent word, which in our case is ‘trump’, is plotted as (1, 111).

For some of you, that plot may seem a bit familiar. That’s because it’s seems to be following the poIr-law distribution. So, to visually confirm it, we can use a log-log plot:

Positive Word Frequency Log-Log Plot

That is exactly what I expected. An almost straight line, with a heavy tail (noisy tail), shows that our data fit under the Zipf’s Law. So, in other words, the above plot shows that in our word distribution, a vast minority of the words appear the most, while the majority of words appear less.

Word Distribution (Negative)

Now that we examined the positive words, it’s time to swift towards the negative ones. I followed the exact same methodology as in positive to gather the words, so let’s get to the point right away:

all_words_neg = []
with open("neg_news_titles.txt", "r", encoding='utf-8',
          errors='ignore') as f_neg:
    for line in f_neg.readlines():
        words = tokenizer.tokenize(line)
        for w in words:
            if w.lower() not in stop_words:
                all_words_neg.append(w.lower())

Reddit Headlines Output 2

Well, the President does it again. He's also in top of the negative words which is also to be expected. An interesting addition to the list are the words ‘ban’ and ‘obamacare’. Obamacare is other side of the coin, regarding the health care program. But, let’s take a look on some of those headlines:

Reddit Headlines Output 2

As you can see, the importance of word ‘ban’, was due to the travel ban issue and Hawaii’s response. The ‘obamacare’ frequency, was due to the complains and dramatic predictions of what the new healthcare system would bring.

Let’s examine the macroscopic side of the negative words distribution:

Negative Word Frequency Distribution Plot

Negative Word Frequency Distribution Log-Log Plot

Negative distribution fits under the Zipf Law as well. With a bit of more smooth slope, but the heavy tail is definitely there. The conclusion to be drawn here, is the exact same as the previous one shown in positive distribution.

Conclusion

That's it for this tutorial. As you can see, Reddit's API makes it extremely easy to compile a lot of news headline data fairly quickly, so I this helps get you started gathering basic insight with sentiment analysis. There's still a lot to do with this data, so stay tuned for the the next tutorial, I will use this dataset to train a sentiment classifier.

Nikos Koufos

Nikos Koufos

LearnDataSci Author, postgraduate in Computer Science & Engineering at the University Ioannina, Greece, and Computer Science undergraduate teaching assistant.

Leave a Reply

Be the First to Comment!

Notify of
avatar
wpDiscuz

Send this to a friend