Data Science
Sentiment Analysis on Reddit Tech News with Python
A quick guide to sentiment analysis with NLTK on the subreddit r/technews.
Sentiment Analysis is the process of determining whether a piece of text is considered to be positive, negative, or neutral.
It’s an application of Natural Language Processing that has tons of use cases.
As stated in Wikipedia:
Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
Imagine you’re a business owner, and you have over 10,000 product reviews for your product. You want to know what your customers think about your product, but you don’t have the time to sift through them one by one.
With sentiment analysis, you can automate that process or even have real-time monitoring to deal with feedback swiftly.
Below is an example of sentiment analysis in action on product reviews.
To showcase how you can perform sentiment analysis in Python, in this article, I will use the PRAW library to interact with the Reddit API to grab posts from the subreddit technews.
Then, I’ll use the NLTK library, specifically using the VADER sentiment analysis to perform sentiment analysis on the post titles.
As always, here’s where you can find the code for this article:
This post was inspired by the article “Sentiment Analysis on Reddit News Headlines with Python’s Natural Language Toolkit (NLTK)” on learndatasci.com.
Create a Reddit application
The first step is to create a Reddit app. To do so, you would first need a Reddit account. If you don’t have one, you can register one here.
After you’re logged in, head over to reddit.com/prefs/apps, and you will see this interface.
There are 3 essential things you need to do:
1. select the script option
2. name: your_reddit_username
3. redirect url: http://localhost
After that, you can hit create app
, and on the upper left corner, you will see something like this.
From the above image, what you want to note down is the client_id
and client_secret
, which you’ll use to build a Reddit client.
Now that you have the credentials, we can move on to the code!
Load Libraries
First things first, we import all the necessary libraries for this project.
import pandas as pd
import numpy as np
# misc
import datetime as dt
from pprint import pprint
from itertools import chain
# reddit crawler
import praw
# sentiment analysis
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, RegexpTokenizer # tokenize words
from nltk.corpus import stopwords
# visualization
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 8) # default plot size
import seaborn as sns
sns.set(style='whitegrid', palette='Dark2')
from wordcloud import WordCloud
pprint
— a Data pretty printer that outputs data structures in a cleaner format.itertools
— iterators for efficient looping, one of which ischain
which I used to join chain together multiple lists into a single list.NLTK
— Natural Language Toolkit, an open-source Python library for NLP, containing a set of text processing libraries for classification, tokenization, stemming, and tagging.PRAW
— The Python Reddit API wrapper allows you to interact with Reddit API using Python.
Downloading NLTK’s databases
nltk.download('vader_lexicon') # get lexicons data
nltk.download('punkt') # for tokenizer
nltk.download('stopwords')
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
nltk.download()
is used to download a particular dataset/model. For this article, there are three things to download.
Vader lexicon
— Dataset of lexicons containing the sentiments of specific texts which powers the Vader Sentiment Analysispunkt
— Pre-trained models that help us tokenize sentences.stopwords
— Dataset of common stopwords in English.
With that, we can set up the client.
Setting up Reddit client
r = praw.Reddit(user_agent='your_user_name',
client_id='your_client_id',
client_secret='your_client_secret',
check_for_async=False)
With the credentials you generated earlier, you can pass in the user_agent your Reddit user name and the rest as follows. Note that the check_for_async
was set to False
just so that it won’t generate warnings later on.
Selecting subreddit and sorting type
subreddit = r.subreddit('technews')
news = [*subreddit.top(limit=None)] # top posts all time
print(len(news))
967
As mentioned in the subtitle of this article, we’ll be scraping the subreddit r/technews, but you can choose any subreddit you want to analyze, replace 'technews'
with the subreddit name of your choosing.
Here I’m getting the top posts all time, and I set the limit to None
to get the maximum amount of posts possible (the limit is 1000 posts).
You can find more options, such as sorting by new, hot, rising, etc., in PRAW’s quick start guide.
Notice the *
symbol, this is known as the star expression, and it has the functionality to unpack iterables. In this case, what it does is unpack the output generated by the function into a list.
Printing the length, tells us we obtained a total of 967 posts.
news0 = news[0]
# pprint(vars(news0))
print(news0.title) # headline
print(news0.score) # upvotes
print(news0.created) # UNIX timestamps
print(dt.datetime.fromtimestamp(news0.created)) # date and time
print(news0.num_comments) # no. of comments
print(news0.upvote_ratio) # upvote / total votes
print(news0.total_awards_received) # no. of awards given
Amazon VP Resigns, Calls Company ‘Chickenshit’ for Firing Protesting Workers
56845
1588604851.0
2020-05-04 15:07:31
1747
0.94
10
Grabbing the first post we scraped by indexing 0
, you can see that you can get various kinds of information from the — number of upvotes, date and time, number of comments, total upvotes, and number of awards given.
You can run vars
on the first post object to get all the information to contain within a single post (warning: the output is huge).
For this article, we only need the title, so what we’ll do is extract the title for each post and dump it into a list.
With this list of headlines, we can now form a Pandas data frame.
# create lists of the information from each news
title = [news.title for news in news]
news = pd.DataFrame({
"title": title,
})
news.head()
title | |
---|---|
0 | Amazon VP Resigns, Calls Company ‘Chickenshit’… |
1 | Robinhood plummets back down to a one-star rat… |
2 | Twitter hides Trump tweet attacking Supreme Co… |
3 | Parler CEO says even his lawyers are abandonin… |
4 | Trump blocked by Twitter and Facebook |
Going to the subreddit on Reddit, you can see we grabbed the post titles!
With over 900 post titles in a data frame, it’s time for some sentiment analysis!
Sentiment Analysis with VADER
What is VADER?
According to their Github:
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
In other words, it’s a pre-trained sentiment analysis model for text sentiment analysis. This model relies on the vader_lexicon
the dataset we downloaded earlier, which will map lexical features to the sentiment scores.
When given a string of words, VADER returns a dictionary containing the four scores:
neg
— negativeneu
— neutralpos
— positive- compound (normalization of three scores above)
Below you see examples of VADER in action.
sid = SentimentIntensityAnalyzer()
pos_text = "Vader is awesome"
cap_pos_text = "Vader is AWESOME!" # captilization and ! increases the effect
neg_text = "Vader is bad"
print(sid.polarity_scores(pos_text))
print(sid.polarity_scores(cap_pos_text))
print(sid.polarity_scores(neg_text))
{'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}
{'neg': 0.0, 'neu': 0.281, 'pos': 0.719, 'compound': 0.729}
{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'compound': -0.5423}
Notice that the words ‘awesome’ and ‘bad’ skews towards positive and negative polarity based on their respective sentiment.
Also, the intensity of emotion is considered as capitalizing the word ‘awesome,’ and adding an exclamation mark increases the positive score.
You can view more examples on their Github.
Now you know a little bit about what Vader is and what it can do, let’s apply it to our data frame.
res = [*news['title'].apply(sid.polarity_scores)]
pprint(res[:3])
[{'compound': -0.7003, 'neg': 0.493, 'neu': 0.395, 'pos': 0.112},
{'compound': 0.34, 'neg': 0.0, 'neu': 0.789, 'pos': 0.211},
{'compound': -0.0258, 'neg': 0.288, 'neu': 0.491, 'pos': 0.221}]
With the scores calculated in dictionaries, we create a data frame using from_records
and then concatenate it to our data frame on an inner join.
sentiment_df = pd.DataFrame.from_records(res)
news = pd.concat([news, sentiment_df], axis=1, join='inner')
news.head()
title | neg | neu | pos | compound | |
---|---|---|---|---|---|
0 | Amazon VP Resigns, Calls Company ‘Chickenshit’… | 0.493 | 0.395 | 0.112 | -0.7003 |
1 | Robinhood plummets back down to a one-star rat… | 0.000 | 0.789 | 0.211 | 0.3400 |
2 | Twitter hides Trump tweet attacking Supreme Co… | 0.288 | 0.491 | 0.221 | -0.0258 |
3 | Parler CEO says even his lawyers are abandonin… | 0.245 | 0.755 | 0.000 | -0.3818 |
4 | Trump blocked by Twitter and Facebook | 0.296 | 0.704 | 0.000 | -0.2732 |
Now that we have the scores, the next step is to choose a threshold to label the text as positive, negative, or neutral.
Choosing the threshold
The VADER Github readme tells us that the typical threshold is 0.05. But following this article, which also did sentiment analysis on news headlines, I’ll use the value 0.2
THRESHOLD = 0.2
conditions = [
(news['compound'] <= -THRESHOLD),
(news['compound'] > -THRESHOLD) & (news['compound'] < THRESHOLD),
(news['compound'] >= THRESHOLD),
]
values = ["neg", "neu", "pos"]
news['label'] = np.select(conditions, values)
news.head()
title | neg | neu | pos | compound | label | |
---|---|---|---|---|---|---|
0 | Amazon VP Resigns, Calls Company ‘Chickenshit’… | 0.493 | 0.395 | 0.112 | -0.7003 | neg |
1 | Robinhood plummets back down to a one-star rat… | 0.000 | 0.789 | 0.211 | 0.3400 | pos |
2 | Twitter hides Trump tweet attacking Supreme Co… | 0.288 | 0.491 | 0.221 | -0.0258 | neu |
3 | Parler CEO says even his lawyers are abandonin… | 0.245 | 0.755 | 0.000 | -0.3818 | neg |
4 | Trump blocked by Twitter and Facebook | 0.296 | 0.704 | 0.000 | -0.2732 | neg |
VADER on individual words
If you’re curious about how VADER ended up labeling the sentiment of the titles, here’s a broken-down version that shows which word it categorizes as positive, neutral, and negative.
sentence0 = news.title.iloc[0]
print(sentence0)
words0 = news.title.iloc[0].split()
print(words0)
pos_list, neg_list, neu_list = [], [], []
for word in words0:
if (sid.polarity_scores(word)['compound']) >= THRESHOLD:
pos_list.append(word)
elif (sid.polarity_scores(word)['compound']) <= -THRESHOLD:
neg_list.append(word)
else:
neu_list.append(word)
print('\nPositive:',pos_list)
print('Neutral:',neu_list)
print('Negative:',neg_list)
score = sid.polarity_scores(sentence0)
print(f"\nThis sentence is {round(score['neg'] * 100, 2)}% negative")
print(f"This sentence is {round(score['neu'] * 100, 2)}% neutral")
print(f"This sentence is {round(score['pos'] * 100, 2)}% positive")
print(f"The compound value : {score['compound']} <= {-THRESHOLD}")
print(f"\nThis sentence is NEGATIVE")
# source https://stackoverflow.com/a/51515048/11386747
Amazon VP Resigns, Calls Company ‘Chickenshit’ for Firing Protesting Workers
['Amazon', 'VP', 'Resigns,', 'Calls', 'Company', '‘Chickenshit’', 'for', 'Firing', 'Protesting', 'Workers']
Positive: []
Neutral: ['Amazon', 'VP', 'Calls', 'Company', '‘Chickenshit’', 'for', 'Workers']
Negative: ['Resigns,', 'Firing', 'Protesting']
This sentence is 49.3% negative
This sentence is 39.5% neutral
This sentence is 11.2% positive
The compound value : -0.7003 <= -0.2
This sentence is NEGATIVE
Notice there were no positive words in this sentence, and there were three negative words. Since there are more negatives than positives, it makes sense that this was labeled as negative.
If you want to go a step further and learn how the compound score is calculated, check out this StackOverflow post.
Now that we have our labels, we can do a quick value count on each label.
news.label.value_counts()
neu 475
neg 279
pos 212
Name: label, dtype: int64
sns.histplot(news.label);
With our selected threshold, we have mostly neutral titles and more negative titles than positive titles.
Are the labels accurate?
def news_title_output(df, label):
res = df[df['label'] == label].title.values
print(f'{"=" * 20}')
print("\n".join(title for title in res))
# randomly sample
news_sub = news.groupby('label').sample(n = 5, random_state = 7)
print("Positive news")
news_title_output(news_sub, "pos")
print("\nNeutral news")
news_title_output(news_sub, "neu")
print("\nNegative news")
news_title_output(news_sub, "neg")
Positive news
====================
Man who can't remember his password to unlock $240M in Bitcoins says he has 'made peace' with the loss and moving on
Artificial Intelligence Finds A Strong New Antibiotic For the Very First Time
Starship SN10 has ‘good chance of flying this week’, Elon Musk says
Rural UK users testing Elon Musk’s satellite broadband reveal ‘amazing’ improvement
DoorDash launches gifting feature in time for the holidays
Neutral news
====================
New privacy bill would end law enforcement practice of buying data from brokers
Pentagon officially releases UFO videos
Toshiba Wraps Up Its Laptop Business
Coronavirus is slowing LCD production, and TV and monitor prices are expected to climb as a result
Locally Run ISPs Offer the Fastest Broadband in America
Negative news
====================
Firefox 'Total Cookie Protection' Tries to Block Even More Online Tracking
Apple blocks Facebook update that called out 30-percent App Store ‘tax’
Mother finds fake Facebook ad claiming her family died from coronavirus
Apple products worth £5m stolen from lorry on M1
Segway will stop making its iconic self-balancing scooter
Taking random samples of each label and using a custom function that outputs the news titles, we can get a sense of how well our threshold performs in categorizing news as positive, neutral, and negative.
From the output, the labels seem to be pretty accurate.
A side tangent: Usually sentiment analysis makes more sense when applied on a “target subject”, such as reviews on a book, or comments on a YouTube video. News headlines are, on the other hand, pretty descriptive and neutral, so sentiment analysis might be misleading.
Let’s now move on to tokenization.
Tokenization
What is it?
Tokenization is the process of breaking down a piece of text into smaller components known as tokens. A token can be a word, a part of a word, or any character like punctuation, symbol or even emojis 🤯.
Why we do it?
Tokenization builds the foundation for any NLP tasks, as these tokens provide context and help computers interpret the meaning of the text. Different kinds of tokens can serve different purposes, but the main idea is to turn them into a usable form for computers.
You can use many different tools to tokenize strings, but NLTK already has a set of tokenizers we can utilize.
NLTK tokenizers
NLTK has many built-in tokenizers that you can use for specific purposes.
A few notable tokenizers are:
word_tokenize
— Splits string by punctuation other than periodssent_tokenize
— Splits a string into sentencesRegexpTokenize
— Splits string based on a regular expression.- more in their documentation
text = "Let's see how the NLTK tokenizer works!"
# using word tokenizer
print(nltk.word_tokenize(text))
# using regexp tokenizer
tk = nltk.tokenize.RegexpTokenizer(r'\s+', gaps=True) # split on whitespace
print(tk.tokenize(text))
tk = nltk.tokenize.RegexpTokenizer(r'\w+') # remove punct
print(tk.tokenize(text))
['Let', "'s", 'see', 'how', 'the', 'NLTK', 'tokenizer', 'works', '!']
["Let's", 'see', 'how', 'the', 'NLTK', 'tokenizer', 'works!']
['Let', 's', 'see', 'how', 'the', 'NLTK', 'tokenizer', 'works']
Above, you can see an example of a text being split by the tokenizers.
Notice how each of the tokenizers words differently based on how it’s split.
The first one splits by punctuation, which splits the word “Let’s”
into "Let"
and "'s"
, whereas the second one that splits by whitespace keeps the word Let's
. As for the last one, splitting by word results in the punctuation being removed.
One thing that comes up when you learn about tokenization is stop words. They’re basically the most common words in the English language, and we remove them so we can focus on more important features (words) instead.
stop_words = stopwords.words('english')
print(len(stop_words))
print(stop_words[:10])
179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
By downloading the ‘stopwords’ database with NLTK earlier on, we have access to a total of 179 of them, which we will use to filter them out from our text.
Custom tokenize
In some cases, you would also do further preprocessing to get the result that you want.
def custom_tokenize(text):
# remove single quote and dashes
text = text.replace("'", "").replace("-", "").lower()
# split on words only
tk = nltk.tokenize.RegexpTokenizer(r'\w+')
tokens = tk.tokenize(text)
# remove stop words
words = [w for w in tokens if not w in stop_words]
return words
print(custom_tokenize(text))
['lets', 'see', 'nltk', 'tokenizer', 'works']
In this function, I remove the single quote so that words like “Let’s” will become “Lets”, and I also removed hyphens so the word “covid-19” would be “covid19”, instead of being separate as “covid” and “19”.
Note: I removed the single quote because I’m only using the tokens for visualization. If you decide to use it to build a model, it would destroy the meaning behind the original words, i.e. from it’s to its, which are two different things.
The text was also lowercased, and stop words are filtered with a list comprehension.
def tokens_2_words(df, label):
# subset titles based on label
titles = df[df['label'] == label].title
# apply our custom tokenize function to each title
tokens = titles.apply(custom_tokenize)
# join nested lists into a single list
words = list(chain.from_iterable(tokens))
return words
pos_words = tokens_2_words(news, 'pos')
neg_words = tokens_2_words(news, 'neg')
Using Pandas’ nifty apply
function, we can apply our custom function onto each title in our data frame.
The tokens
object is a nested list (multiple lists within a list). Since we want all the words in a single list, the method chain
which comes from the itertools
library helps us do exactly that.
The end result is two lists, containing the words of titles that were labelled as positive and negative.
Visualize tokens
Top 20 words
With our list of words, we can utilize NLTK’s built-in function FreqDist
as a counter for the words within our list, and most_common
to return the top words based on the count.
pos_freq = nltk.FreqDist(pos_words)
pos_freq.most_common(20)
[('apple', 13),
('google', 12),
('new', 12),
('musk', 11),
('tech', 11),
('first', 11),
('free', 10),
('elon', 10),
('could', 10),
('tesla', 10),
('000', 9),
('twitter', 9),
('million', 8),
('facebook', 8),
('spacex', 8),
('says', 8),
('help', 8),
('support', 7),
('years', 7),
('5', 7)]
From our list of positive words, we see the word “apple” and “google” are the top words. Notice how the numbers 5 and 000 are present in our list, they can also be filtered if you want to with more preprocessing.
Usually, when visualizing tokens, a better option is to use word clouds, as the size of the words correlates with their count, so you have a better idea of which words are important.
Word clouds
Here is the word cloud generated for the positive and negative words list.
We can imagine what positive news was related to these words from the positive word cloud.
The words “Apple” and “Google” could be the good deeds that the big tech companies are doing.
We also see the words “Elon Musk”, “Tesla”, and “SpaceX” amongst the top positive words, which is most likely some technological advancements or maybe philanthropy works of Elon.
To find out the exact news, I wrote up a function to extract the titles.
When given the words Elon Musk, these titles were extracted.
extract_sentence_from_word(news, "Elon Musk", "pos")
Elon Musk Delivers 1,000 Ventilators to California Hospitals to Treat COVID-19 Patients
Elon Musk’s Australian Battery Farm Has Saved $116 Million AUD In Two Years
Rural UK users testing Elon Musk’s satellite broadband reveal ‘amazing’ improvement
Now let’s have a look at the negative words.
At first glance, we can tell the big tech companies are more prominent in the negative words, along with the words “ban”, “internet”, “data”, and “Trump”. This suggests it was the news about Donald Trump being banned from social media companies.
In this word cloud, negative words are also more evident. As words like “fake”, “misinformation”, “lawsuit”, “hacked”, “attack”, “blocking”, etc. are popping up.
Extracting the titles on the word “Facebook”, and sure enough, it was about him being banned.
extract_sentence_from_word(news, "facebook", "neg")
Trump blocked by Twitter and Facebook
Facebook is finally banning anti-vaxxer misinformation
Facebook and Instagram make Trump's ban indefinite
Notice the second title — being positive news — is labeled as negative because of the words “banning” and “misinformation”, which shows you the limitation of VADER.
There you go! You scraped Reddit tech news headlines, did sentiment analysis on them, tokenize the titles, and generated word clouds!
This was just a glimpse into what NLTK can achieve in terms of NLP, and there are definitely improvements you can make to the sentiment analysis to label the posts more accurately.
If you want to know more, there are a few articles below for you to dive deeper into this topic!
That’s all for this article, and I hope you learn something new from it!
Thanks for reading 😉 !
Links
Further readings
- Sentiment Analysis: A Definitive Guide
- Simplifying Sentiment Analysis using VADER in Python (on Social Media Text) by Parul Pandey
- Tokenization in NLP — Types, Challenges, Examples, Tools
- How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)
Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!