Sentiment Analysis on Reddit Tech News with Python

Data Science

Sentiment Analysis on Reddit Tech News with Python

A quick guide to sentiment analysis with NLTK on the subreddit r/technews.

Sentiment Analysis is the process of determining whether a piece of text is considered to be positive, negative, or neutral.

It’s an application of Natural Language Processing that has tons of use cases.

As stated in Wikipedia:

Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

Imagine you’re a business owner, and you have over 10,000 product reviews for your product. You want to know what your customers think about your product, but you don’t have the time to sift through them one by one.

With sentiment analysis, you can automate that process or even have real-time monitoring to deal with feedback swiftly.

Below is an example of sentiment analysis in action on product reviews.

source: monkey learn sentiment analysis guide article

To showcase how you can perform sentiment analysis in Python, in this article, I will use the PRAW library to interact with the Reddit API to grab posts from the subreddit technews.

Then, I’ll use the NLTK library, specifically using the VADER sentiment analysis to perform sentiment analysis on the post titles.

As always, here’s where you can find the code for this article:

This post was inspired by the article “Sentiment Analysis on Reddit News Headlines with Python’s Natural Language Toolkit (NLTK)” on learndatasci.com.

Create a Reddit application

The first step is to create a Reddit app. To do so, you would first need a Reddit account. If you don’t have one, you can register one here.

After you’re logged in, head over to reddit.com/prefs/apps, and you will see this interface.

There are 3 essential things you need to do:

1. select the script option
2. name: your_reddit_username
3. redirect url: http://localhost

After that, you can hit create app, and on the upper left corner, you will see something like this.

*Note you shouldn’t expose your credentials online, I already deleted mine so it’s fine.*

From the above image, what you want to note down is the client_id and client_secret, which you’ll use to build a Reddit client.

Now that you have the credentials, we can move on to the code!

Load Libraries

First things first, we import all the necessary libraries for this project.

import pandas as pd
import numpy as np

# misc
import datetime as dt
from pprint import pprint
from itertools import chain

# reddit crawler
import praw

# sentiment analysis
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, RegexpTokenizer # tokenize words
from nltk.corpus import stopwords

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 8) # default plot size
import seaborn as sns
sns.set(style='whitegrid', palette='Dark2')
from wordcloud import WordCloud

pprint — a Data pretty printer that outputs data structures in a cleaner format.
itertools — iterators for efficient looping, one of which is chain which I used to join chain together multiple lists into a single list.
NLTK — Natural Language Toolkit, an open-source Python library for NLP, containing a set of text processing libraries for classification, tokenization, stemming, and tagging.
PRAW — The Python Reddit API wrapper allows you to interact with Reddit API using Python.

Downloading NLTK’s databases

nltk.download('vader_lexicon') # get lexicons data
nltk.download('punkt') # for tokenizer
nltk.download('stopwords')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

nltk.download() is used to download a particular dataset/model. For this article, there are three things to download.

Vader lexicon — Dataset of lexicons containing the sentiments of specific texts which powers the Vader Sentiment Analysis
punkt — Pre-trained models that help us tokenize sentences.
stopwords — Dataset of common stopwords in English.

With that, we can set up the client.

Setting up Reddit client

r = praw.Reddit(user_agent='your_user_name',
                client_id='your_client_id',
                client_secret='your_client_secret',
                check_for_async=False)

With the credentials you generated earlier, you can pass in the user_agent your Reddit user name and the rest as follows. Note that the check_for_async was set to False just so that it won’t generate warnings later on.

Selecting subreddit and sorting type

subreddit = r.subreddit('technews')

news = [*subreddit.top(limit=None)] # top posts all time

print(len(news))

967

As mentioned in the subtitle of this article, we’ll be scraping the subreddit r/technews, but you can choose any subreddit you want to analyze, replace 'technews' with the subreddit name of your choosing.

Here I’m getting the top posts all time, and I set the limit to None to get the maximum amount of posts possible (the limit is 1000 posts).

You can find more options, such as sorting by new, hot, rising, etc., in PRAW’s quick start guide.

Notice the * symbol, this is known as the star expression, and it has the functionality to unpack iterables. In this case, what it does is unpack the output generated by the function into a list.

Printing the length, tells us we obtained a total of 967 posts.

news0 = news[0]

# pprint(vars(news0)) 
print(news0.title) # headline
print(news0.score) # upvotes
print(news0.created) # UNIX timestamps 
print(dt.datetime.fromtimestamp(news0.created)) # date and time
print(news0.num_comments) # no. of comments
print(news0.upvote_ratio) # upvote / total votes
print(news0.total_awards_received) # no. of awards given

Amazon VP Resigns, Calls Company ‘Chickenshit’ for Firing Protesting Workers
56845
1588604851.0
2020-05-04 15:07:31
1747
0.94
10

Grabbing the first post we scraped by indexing 0, you can see that you can get various kinds of information from the — number of upvotes, date and time, number of comments, total upvotes, and number of awards given.

You can run vars on the first post object to get all the information to contain within a single post (warning: the output is huge).

For this article, we only need the title, so what we’ll do is extract the title for each post and dump it into a list.

With this list of headlines, we can now form a Pandas data frame.

# create lists of the information from each news
title = [news.title for news in news]

news = pd.DataFrame({
    "title": title,
})
news.head()

title
0	Amazon VP Resigns, Calls Company ‘Chickenshit’…
1	Robinhood plummets back down to a one-star rat…
2	Twitter hides Trump tweet attacking Supreme Co…
3	Parler CEO says even his lawyers are abandonin…
4	Trump blocked by Twitter and Facebook

Going to the subreddit on Reddit, you can see we grabbed the post titles!

screenshot of subreddit technews on reddit.com

With over 900 post titles in a data frame, it’s time for some sentiment analysis!

Sentiment Analysis with VADER

What is VADER?

According to their Github:

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

In other words, it’s a pre-trained sentiment analysis model for text sentiment analysis. This model relies on the vader_lexicon the dataset we downloaded earlier, which will map lexical features to the sentiment scores.

When given a string of words, VADER returns a dictionary containing the four scores:

neg — negative
neu — neutral
pos — positive
compound (normalization of three scores above)

Below you see examples of VADER in action.

sid = SentimentIntensityAnalyzer()

pos_text = "Vader is awesome"
cap_pos_text = "Vader is AWESOME!" # captilization and ! increases the effect
neg_text = "Vader is bad"

print(sid.polarity_scores(pos_text))
print(sid.polarity_scores(cap_pos_text))
print(sid.polarity_scores(neg_text))

{'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}
{'neg': 0.0, 'neu': 0.281, 'pos': 0.719, 'compound': 0.729}
{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'compound': -0.5423}

Notice that the words ‘awesome’ and ‘bad’ skews towards positive and negative polarity based on their respective sentiment.

Also, the intensity of emotion is considered as capitalizing the word ‘awesome,’ and adding an exclamation mark increases the positive score.

You can view more examples on their Github.

Now you know a little bit about what Vader is and what it can do, let’s apply it to our data frame.

res = [*news['title'].apply(sid.polarity_scores)]
pprint(res[:3])

[{'compound': -0.7003, 'neg': 0.493, 'neu': 0.395, 'pos': 0.112},
 {'compound': 0.34, 'neg': 0.0, 'neu': 0.789, 'pos': 0.211},
 {'compound': -0.0258, 'neg': 0.288, 'neu': 0.491, 'pos': 0.221}]

With the scores calculated in dictionaries, we create a data frame using from_records and then concatenate it to our data frame on an inner join.

sentiment_df = pd.DataFrame.from_records(res)
news = pd.concat([news, sentiment_df], axis=1, join='inner')
news.head()

title	neg	neu	pos	compound
0	Amazon VP Resigns, Calls Company ‘Chickenshit’…	0.493	0.395	0.112	-0.7003
1	Robinhood plummets back down to a one-star rat…	0.000	0.789	0.211	0.3400
2	Twitter hides Trump tweet attacking Supreme Co…	0.288	0.491	0.221	-0.0258
3	Parler CEO says even his lawyers are abandonin…	0.245	0.755	0.000	-0.3818
4	Trump blocked by Twitter and Facebook	0.296	0.704	0.000	-0.2732

Now that we have the scores, the next step is to choose a threshold to label the text as positive, negative, or neutral.

Choosing the threshold

The VADER Github readme tells us that the typical threshold is 0.05. But following this article, which also did sentiment analysis on news headlines, I’ll use the value 0.2

THRESHOLD = 0.2

conditions = [
    (news['compound'] <= -THRESHOLD),
    (news['compound'] > -THRESHOLD) & (news['compound'] < THRESHOLD),
    (news['compound'] >= THRESHOLD),
    ]

values = ["neg", "neu", "pos"]
news['label'] = np.select(conditions, values)

news.head()

title	neg	neu	pos	compound	label
0	Amazon VP Resigns, Calls Company ‘Chickenshit’…	0.493	0.395	0.112	-0.7003	neg
1	Robinhood plummets back down to a one-star rat…	0.000	0.789	0.211	0.3400	pos
2	Twitter hides Trump tweet attacking Supreme Co…	0.288	0.491	0.221	-0.0258	neu
3	Parler CEO says even his lawyers are abandonin…	0.245	0.755	0.000	-0.3818	neg
4	Trump blocked by Twitter and Facebook	0.296	0.704	0.000	-0.2732	neg

VADER on individual words

If you’re curious about how VADER ended up labeling the sentiment of the titles, here’s a broken-down version that shows which word it categorizes as positive, neutral, and negative.

sentence0 = news.title.iloc[0]
print(sentence0)
words0 = news.title.iloc[0].split()
print(words0)

pos_list, neg_list, neu_list = [], [], []

for word in words0:
  if (sid.polarity_scores(word)['compound']) >= THRESHOLD:
    pos_list.append(word)
  elif (sid.polarity_scores(word)['compound']) <= -THRESHOLD:
    neg_list.append(word)
  else:
    neu_list.append(word)                

print('\nPositive:',pos_list)        
print('Neutral:',neu_list)    
print('Negative:',neg_list) 
score = sid.polarity_scores(sentence0)

print(f"\nThis sentence is {round(score['neg'] * 100, 2)}% negative")
print(f"This sentence is {round(score['neu'] * 100, 2)}% neutral")
print(f"This sentence is {round(score['pos'] * 100, 2)}% positive")
print(f"The compound value : {score['compound']} <= {-THRESHOLD}")
print(f"\nThis sentence is NEGATIVE")

# source https://stackoverflow.com/a/51515048/11386747

Amazon VP Resigns, Calls Company ‘Chickenshit’ for Firing Protesting Workers
['Amazon', 'VP', 'Resigns,', 'Calls', 'Company', '‘Chickenshit’', 'for', 'Firing', 'Protesting', 'Workers']

Positive: []
Neutral: ['Amazon', 'VP', 'Calls', 'Company', '‘Chickenshit’', 'for', 'Workers']
Negative: ['Resigns,', 'Firing', 'Protesting']

This sentence is 49.3% negative
This sentence is 39.5% neutral
This sentence is 11.2% positive
The compound value : -0.7003 <= -0.2

This sentence is NEGATIVE

Notice there were no positive words in this sentence, and there were three negative words. Since there are more negatives than positives, it makes sense that this was labeled as negative.

If you want to go a step further and learn how the compound score is calculated, check out this StackOverflow post.

Now that we have our labels, we can do a quick value count on each label.

news.label.value_counts()

neu    475
neg    279
pos    212
Name: label, dtype: int64

sns.histplot(news.label);

With our selected threshold, we have mostly neutral titles and more negative titles than positive titles.

Are the labels accurate?

def news_title_output(df, label):
  res = df[df['label'] == label].title.values
  print(f'{"=" * 20}')
  print("\n".join(title for title in res))

# randomly sample
news_sub = news.groupby('label').sample(n = 5, random_state = 7)

print("Positive news")
news_title_output(news_sub, "pos")

print("\nNeutral news")
news_title_output(news_sub, "neu")

print("\nNegative news")
news_title_output(news_sub, "neg")

Positive news
====================
Man who can't remember his password to unlock $240M in Bitcoins says he has 'made peace' with the loss and moving on
Artificial Intelligence Finds A Strong New Antibiotic For the Very First Time
Starship SN10 has ‘good chance of flying this week’, Elon Musk says
Rural UK users testing Elon Musk’s satellite broadband reveal ‘amazing’ improvement
DoorDash launches gifting feature in time for the holidays

Neutral news
====================
New privacy bill would end law enforcement practice of buying data from brokers
Pentagon officially releases UFO videos
Toshiba Wraps Up Its Laptop Business
Coronavirus is slowing LCD production, and TV and monitor prices are expected to climb as a result
Locally Run ISPs Offer the Fastest Broadband in America

Negative news
====================
Firefox 'Total Cookie Protection' Tries to Block Even More Online Tracking
Apple blocks Facebook update that called out 30-percent App Store ‘tax’
Mother finds fake Facebook ad claiming her family died from coronavirus
Apple products worth £5m stolen from lorry on M1
Segway will stop making its iconic self-balancing scooter

Taking random samples of each label and using a custom function that outputs the news titles, we can get a sense of how well our threshold performs in categorizing news as positive, neutral, and negative.

From the output, the labels seem to be pretty accurate.

A side tangent: Usually sentiment analysis makes more sense when applied on a “target subject”, such as reviews on a book, or comments on a YouTube video. News headlines are, on the other hand, pretty descriptive and neutral, so sentiment analysis might be misleading.

Let’s now move on to tokenization.

Tokenization

What is it?

Tokenization is the process of breaking down a piece of text into smaller components known as tokens. A token can be a word, a part of a word, or any character like punctuation, symbol or even emojis 🤯.

Why we do it?

Tokenization builds the foundation for any NLP tasks, as these tokens provide context and help computers interpret the meaning of the text. Different kinds of tokens can serve different purposes, but the main idea is to turn them into a usable form for computers.

You can use many different tools to tokenize strings, but NLTK already has a set of tokenizers we can utilize.

NLTK tokenizers

NLTK has many built-in tokenizers that you can use for specific purposes.

A few notable tokenizers are:

word_tokenize — Splits string by punctuation other than periods
sent_tokenize — Splits a string into sentences
RegexpTokenize — Splits string based on a regular expression.
more in their documentation

text = "Let's see how the NLTK tokenizer works!"

# using word tokenizer
print(nltk.word_tokenize(text))

# using regexp tokenizer
tk = nltk.tokenize.RegexpTokenizer(r'\s+', gaps=True) # split on whitespace
print(tk.tokenize(text))

tk = nltk.tokenize.RegexpTokenizer(r'\w+') # remove punct
print(tk.tokenize(text))

['Let', "'s", 'see', 'how', 'the', 'NLTK', 'tokenizer', 'works', '!']
["Let's", 'see', 'how', 'the', 'NLTK', 'tokenizer', 'works!']
['Let', 's', 'see', 'how', 'the', 'NLTK', 'tokenizer', 'works']

Above, you can see an example of a text being split by the tokenizers.

Notice how each of the tokenizers words differently based on how it’s split.

The first one splits by punctuation, which splits the word “Let’s” into "Let"and "'s", whereas the second one that splits by whitespace keeps the word Let's. As for the last one, splitting by word results in the punctuation being removed.

One thing that comes up when you learn about tokenization is stop words. They’re basically the most common words in the English language, and we remove them so we can focus on more important features (words) instead.

stop_words = stopwords.words('english')
print(len(stop_words))
print(stop_words[:10])

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

By downloading the ‘stopwords’ database with NLTK earlier on, we have access to a total of 179 of them, which we will use to filter them out from our text.

Custom tokenize

In some cases, you would also do further preprocessing to get the result that you want.

def custom_tokenize(text):
  # remove single quote and dashes
  text = text.replace("'", "").replace("-", "").lower()

  # split on words only
  tk = nltk.tokenize.RegexpTokenizer(r'\w+')
  tokens = tk.tokenize(text)

  # remove stop words
  words = [w for w in tokens if not w in stop_words]
  return words

print(custom_tokenize(text))

['lets', 'see', 'nltk', 'tokenizer', 'works']

In this function, I remove the single quote so that words like “Let’s” will become “Lets”, and I also removed hyphens so the word “covid-19” would be “covid19”, instead of being separate as “covid” and “19”.

Note: I removed the single quote because I’m only using the tokens for visualization. If you decide to use it to build a model, it would destroy the meaning behind the original words, i.e. from it’s to its, which are two different things.

The text was also lowercased, and stop words are filtered with a list comprehension.

def tokens_2_words(df, label):
  # subset titles based on label
  titles = df[df['label'] == label].title
  # apply our custom tokenize function to each title
  tokens = titles.apply(custom_tokenize)
  # join nested lists into a single list
  words = list(chain.from_iterable(tokens))
  return words

pos_words = tokens_2_words(news, 'pos')
neg_words = tokens_2_words(news, 'neg')

Using Pandas’ nifty apply function, we can apply our custom function onto each title in our data frame.

The tokens object is a nested list (multiple lists within a list). Since we want all the words in a single list, the method chain which comes from the itertools library helps us do exactly that.

The end result is two lists, containing the words of titles that were labelled as positive and negative.

Visualize tokens

Top 20 words

With our list of words, we can utilize NLTK’s built-in function FreqDist as a counter for the words within our list, and most_common to return the top words based on the count.

pos_freq = nltk.FreqDist(pos_words)
pos_freq.most_common(20)

[('apple', 13),
 ('google', 12),
 ('new', 12),
 ('musk', 11),
 ('tech', 11),
 ('first', 11),
 ('free', 10),
 ('elon', 10),
 ('could', 10),
 ('tesla', 10),
 ('000', 9),
 ('twitter', 9),
 ('million', 8),
 ('facebook', 8),
 ('spacex', 8),
 ('says', 8),
 ('help', 8),
 ('support', 7),
 ('years', 7),
 ('5', 7)]

From our list of positive words, we see the word “apple” and “google” are the top words. Notice how the numbers 5 and 000 are present in our list, they can also be filtered if you want to with more preprocessing.

Usually, when visualizing tokens, a better option is to use word clouds, as the size of the words correlates with their count, so you have a better idea of which words are important.

Word clouds

Here is the word cloud generated for the positive and negative words list.

Words from post titles labeled as positive from the subreddit r/technews

We can imagine what positive news was related to these words from the positive word cloud.

The words “Apple” and “Google” could be the good deeds that the big tech companies are doing.

We also see the words “Elon Musk”, “Tesla”, and “SpaceX” amongst the top positive words, which is most likely some technological advancements or maybe philanthropy works of Elon.

To find out the exact news, I wrote up a function to extract the titles.

When given the words Elon Musk, these titles were extracted.

extract_sentence_from_word(news, "Elon Musk", "pos")

Elon Musk Delivers 1,000 Ventilators to California Hospitals to Treat COVID-19 Patients
Elon Musk’s Australian Battery Farm Has Saved $116 Million AUD In Two Years
Rural UK users testing Elon Musk’s satellite broadband reveal ‘amazing’ improvement

Now let’s have a look at the negative words.

Words from post titles labeled as negative from the subreddit r/technews

At first glance, we can tell the big tech companies are more prominent in the negative words, along with the words “ban”, “internet”, “data”, and “Trump”. This suggests it was the news about Donald Trump being banned from social media companies.

In this word cloud, negative words are also more evident. As words like “fake”, “misinformation”, “lawsuit”, “hacked”, “attack”, “blocking”, etc. are popping up.

Extracting the titles on the word “Facebook”, and sure enough, it was about him being banned.

extract_sentence_from_word(news, "facebook", "neg")

Trump blocked by Twitter and Facebook
Facebook is finally banning anti-vaxxer misinformation
Facebook and Instagram make Trump's ban indefinite

Notice the second title — being positive news — is labeled as negative because of the words “banning” and “misinformation”, which shows you the limitation of VADER.

There you go! You scraped Reddit tech news headlines, did sentiment analysis on them, tokenize the titles, and generated word clouds!

This was just a glimpse into what NLTK can achieve in terms of NLP, and there are definitely improvements you can make to the sentiment analysis to label the posts more accurately.

If you want to know more, there are a few articles below for you to dive deeper into this topic!

That’s all for this article, and I hope you learn something new from it!

Sentiment Analysis on Reddit Tech News with Python

Data Science