Predicting Reddit News Sentiment with Naive Bayes and Other Text Classifiers
Learn how to predict the sentiment of news headlines mined from Reddit
You should already know:
- Python fundamentals
- Pandas and Matplotlib
- Basics of Sentiment analysis
- Basic machine learning concepts
Learn each interactively with DataCamp
In our previous post, we covered some of the basics of sentiment analysis, where we gathered and categorize political headlines. Now, we can use that data to train a binary classifier to predict if a headline is positive or negative.
- Notebook: GitHub
- Libraries: pandas, numpy, scikit-learn, matplotlib, seaborn, nltk, imblearn
Brief Intro Using Classification and Some Problems We Face
Classification is the process of identifying the category of a new, unseen observation based of a training set of data, which has categories that are known.
In our case, our headlines are the observations and the positive/negative sentiment are the categories. This is a binary classification problem -- we're trying to predict if a headline is either positive or negative.
First Problem: Imbalanced Dataset
One of the most common problems, in machine learning, is working with an imbalanced dataset. As we'll see below, we have a slightly imbalanced dataset, where there's more negatives than positives.
Compared to some problems, like fraud detection, our dataset isn't super imbalanced. Sometimes you'll have datasets where the positive class is only 1% of the training data, the rest being negatives.
We want to be careful with interpreting results from imbalanced data. When producing scores with our classifier, you may experience accuracy up to 90%, which is commonly known as the Accuracy Paradox.
The reason why we might have 90% accuracy is due to our model examining the data and deciding to always predict negative, resulting in high accuracy.
There's a number of ways to counter this problem, such as::
- Collect more data: could help balance the dataset by adding more minor class examples.
- Change you metric: use either the Confusion Matrix, Precision, Recall or F1 score (combination of precision and recall).
- Oversample the data: randomly sample the attributes from examples in the minority class to create more 'fake' data.
- Penalized model: Implements an additional cost on the model for making classification mistakes on the minority class during training. These penalties bias the model towards the minority class.
In our dataset, we have less positive examples than negative examples, and we will explore both different metrics and utilizing an oversampling technique, called SMOTE.
Let's establish a few basic imports:
These are basic imports used across the entire notebook, and are usually imported in every data science project. The more specific imports from sklearn and other libraries will be brought up when we use them.
Loading the Dataset
First let's load the dataset that we created in the last article:
|0||Gillespie Victory In Virginia Would Vindicate ...||0|
|1||Screw Ron Paul and all of his "if he can't aff...||-1|
|2||Corker: Trump, 'perfectly fine,' with scrappin...||1|
|3||Concerning Recent Changes in Allowed Domains||0|
|4||Trump confidantes Bossie, Lewandowski urge aga...||-1|
Now that we have the dataset in a dataframe, let's remove the neutral (0) headlines labels so we can focus on only classifying positive or negative:
Our dataframe now only contains positive and negative examples, and we've confirmed again that we have more negatives than positives.
Let's move into featurization of the headlines.
Transform Headlines into Features
In order to train our classifier, we need to transform our headlines of words into numbers, since algorithms only know how to work with numbers.
To do this transformation, we're going to use
CountVectorizer from sklearn. This is a very straightforward class for converting words into features.
Unlike in the last tutorial where we manually tokenized and lowercased the text,
CountVectorizer will handle this step for us. All we need to do is pass it the headlines.
Let's work with a tiny example to show how vectorizing words into numbers works:
What we've done here is take two headlines about a similar topic and vectorized them.
vect is set up with default params to tokenize and lowercase words. On top of that, we have set
binary=True so we get an output of 0 (word doesn't exist in that sentence) or 1 (word exists in that sentence).
vect builds a vocabulary from all the words it sees in all the text you give it, then assigns a 0 or 1 if that word exists in the current sentence. To see this more clearly, let's check out the feature names mapped to the first sentence:
This is the vectorization mapping of the first sentence. You can see that there's a 1 mapped to 'ahead' because 'ahead' shows up in
s1. But if we look at
There's a 0 at 'ahead' since that word doesn't show up in
s2. But notice that each row contains every word seen so far.
When we expand this to all of the headlines in the dataset, this vocabulary will grow by a lot. Each mapping like the one printed above will end up being the length of all words the vectorizer encounters.
Let's now apply the vectorizer to all of our headlines:
Notice that the vectorizer by default stores everything in a sparse array, and using
X.toarray()shows us the dense version. A sparse array is much more efficient since most values in each row are 0. In other words, most headlines are only a dozen or so words and each row contains every word ever seen, and sparse arrays only store the non-zero value indices.
You'll also notice that we have a new keyword argument;
max_features. This is essentially the number of words to consider, ranked by frequency. So the 1000 value means we only want to look at the 1000 most common words as features.
Now that we know how vectorization works, let's use it in action.
Preparing for Training
Before training, and even vectorizing, let's split our data into training and testing sets. It's important to do this before doing anything with the data so we have a fresh test set.
Our test size is 0.2, or 20%. This means that
y_test contains 20% of our data which we reserve for testing.
Let's now fit the vectorizer on the training set only and perform the vectorization.
Just to reiterate, it's important to not fit the vectorizer on all of the data since we want a clean test set for evaluating performance. Fitting the vectorizer on everything would result in data leakage, causing unreliable results since the vectorizer shouldn't know about future data.
We can fit the vectorizer and transform
X_train in one step:
X_train_vect is now transformed into the right format to give to the Naive Bayes model, but let's first look into balancing the data.
Balancing the Data
It seems that there may be a lot more negative headlines than positive headlines (hmm), and so we have a lot more negative labels than positive labels.
We can see from above, we have slightly more negatives than positives, making our dataset slightly imbalanced.
By calculating if our model only chose to predict -1, the larger class, we would get a ~60% accuracy. This means that in our binary classification model, where random chance is 50%, a 60% accuracy wouldn't tell us much. We would definitely want to look at precision and recall more than accuracy.
We can balance our data by using a form of oversampling called SMOTE. SMOTE looks at the minor class, positives in our case, and creates new, synthetic training examples. Read more about the algorithm here.
Note: We have to make sure we only oversample the train data so we don't leak any information to the test set.
Let's perform SMOTE with the
The classes are now balanced for the train set. We can move onto training a Naive Bayes model.
For our first algorithm, we're going to use the extremely fast and versatile Naive Bayes model.
Let's instantiate one from sklearn and fit it to our training data:
Naive Bayes has successfully fit all of our training data and is ready to make predictions. You'll notice that we have a score of ~92%. This is the fit score, and not the actual accuracy score. You'll see next that we need to use our test set in order to get a good estimate of accuracy.
Let's vectorize the test set, then use that test set to predict if each test headline is either positive or negative. Since we're avoiding any data leakage, we are only transforming, not refitting. And we won't be using SMOTE to oversample either.
y_pred now contains a prediction for every row of the test set. With this prediction result, we can pass it into an sklearn metric with the true labels to get an accuracy score, F1 score, and generate a confusion matrix:
We can see that our model has predicted the sentiment of headlines with a 75% accuracy, but looking at the confusion matrix we can see it's not doing that great of a job classifying.
For a breakdown of the confusion matrix, we have:
- 116 predicted negative (-1), and was negative (-1). True Negative.
- 71 predicted positive (+1), and was positive (+1). True Positive.
- 23 predicted negative (-1), but was positive (+1). False Negative.
- 41 predicted positive (+1), but was negative (-1). False Positive.
So our classifier is getting a lot of the negatives right, but there's a high number of false predictions. We'll see if we can improve these metrics with other classifiers below.
Lets now utilize cross validation, where we generate a training and testing set 10 different times on the same data in different positions.
Right now, we are set up with the usual 80% of the data as training and 20% as the test. The accuracy of prediction on a single test set doesn't say much about generalization. To get a better insight on our classifier’s generalization capabilities, there's two different techniques we can use:
1) K-fold cross-validation: The examples are randomly partitioned into kk equal sized subsets (usually 10). Out of the kk subsets, a single subsample is used for testing the model and the remaining k−1k−1 subsets are used as training data. The cross-validation technique is then repeated kk times, resulting in process where each subset is used exactly once as part of the test set. Finally, the average of the kk-runs is computed. The advantage of this method is that every example is used in both train and test set.
2) Monte Carlo cross-validation: Randomly splits the dataset into train and test data, the model is run, and the results are then averaged. The advantage of this method is that the proportion of the train/test split is not dependent on the number of iterations, which is useful for very large datasets. On the other hand, the disadvantage of this method if you're not running through enough iterations is that some examples may never be selected in the test subset, whereas others may be selected more than once.
For an even better explanation of the differences between these two methods, check out this answer: https://stats.stackexchange.com/a/60967
The relevant class from the sklearn library is
ShuffleSplit. This performs a shuffle first and then a split of the data into train/test. Since it's an iterator, it will perform a random shuffle and split for each iteration. This is an example of the Monte Carlo method mentioned above.
Normally, we could just use
sklearn.model_selection.cross_val_score which automatically calculates a score for each fold, but we're going to show the manual splitting with
Also, if you're familiar with
cross_val_score you'll notice that
ShuffleSplit works differently. The
n_splits parameter in
ShuffleSplit is the number of times to randomize the data and then split it 80/20, whereas the
cv parameter in
cross_val_score is the number of folds. By using a large
n_splits, we can get a good approximation of the true performance on larger datasets, but it's harder to plot.
Looks like the average accuracy and F1 score are both similar to what we saw on a single fold above.
Let's Plot our Results
The F1 score fluctuates greater than 15 points between some runs, which could be remedied with a larger dataset. Let's see how other algorithms do.
Other Classification Algorithms in scikit-learn
As you can see Naive Bayes performed pretty well, so let’s experiment with other classifiers.
We'll use the same shuffle splitting as before, but now we'll run several types of models in each loop:
We now have a bunch of accuracy scores, f1 scores, and confusion matrices stored for each model. Let's average these together to get average scores across models and folds:
We've gotten some pretty decent results, but overall it looks like we need more data to be sure which one performs the best.
Since we're only running metrics on a test set size of about 300 examples, a 0.5% difference in accuracy would mean only ~2 more examples are classified correctly versus the other model(s). If we had a test set of 10,000, a 0.5% difference in accuracy would equal 50 more correctly classified headlines, which is much more reassuring.
The difference between Random Forest and Multinomial Naive Bayes is quite clear, but the difference between Multinomial and Bernoulli Naive Bayes isn't. To compare these two further, we need more data.
Let's see if ensembling can make a better difference.
After we evaluated each classifier individually, let's see if ensembling helps improve our metrics.
We're going to use sklearn's
VotingClassifier which defaults to a majority rule voting.
Although our majority classifier performed great, it didn't differ much from the results we got from Multinomial Naive Bayes, which might have been suprising. Surely mashing a bunch together would give better results, but this lack of difference in performance proves that there's still a lot of areas that need to be explored. For example:
- How more data affects performance (best place to start due to our small dataset)
- Grid searching different parameters for each model
- Debugging the ensemble by looking at model correlations
- Trying different styles of bagging, boosting, and stacking
Final Words and Where To Go From Here
So far we've
- Mined data from Reddit's /r/politics
- Obtained sentiment scores for headlines
- Vectorized the data
- Run the data through several types of models
- Ensembled models together
Unfortunately, there isn't an obvious winning model. There's a couple we've seen that definitely perform poorly, but there's a few that hover around the same accuracy. Additionally, the confusion matrices are showing roughly half of the positive headlines are being misclassified, so there's a lot more work to be done.
Now that you've seen how this pipeline works, there's a lot of room for improvement on the architecture of the code and modeling. I encourage you to try all of this out in the provided notebook. See what other subreddits you can tap into for sentiment, like stocks, companies, products, etc.. There's a lot of valuable data to be had!
Help us make this article and series better
If you're interested in the expansion of this article and series into some of these areas of exploration, drop a comment below and we'll add it to the content pipeline.
Thanks for reading!