Natural Language Processing
Build a News Article Summarizer App with Hugging Face 🤗 and Gradio
Building a text summarizer with Newspaper3k, Hugging Face, and Gradio
Have a ton of articles bookmarked to read later but never got to them? Don’t have the time to read long articles?
Don’t you wish you could summarize them all and get the gist of the articles?
With Hugging Face 🤗 , you can!
In this article, I will be using the Newspaper3k library to extract text from article links and summarize them with Hugging Face pre-trained summarization models. To wrap it all up, I created a simple user interface with Gradio so anyone can get summaries of articles just with their URL!
As always, here’s where you can find the code for this article:
Install Dependencies
Before loading, we install some dependencies we need that aren’t installed by default on Google Colab.
!pip install newspaper3k transformers gradio --quiet
Load libraries
- newspaper3k — News, full-text, and article metadata extraction in Python 3.
- nlkt — To provide newspaper3k with tokenizing functionalities
- transformers — Provides you with thousands of state-of-the-art pre-trained models for a variety of natural language processing (NLP) tasks
- Gradio — A customizable graphical interface for Machine Learning models or even arbitrary Python functions
Creating an article object with Newspaper3k
Before we initialize an article object to get its content, one thing we have to do is set up a user agent, which can allow us to grab information from certain websites without getting HTTP errors such as HTTP 403 Forbidden client error
.
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
url = 'https://www.technologyreview.com/2021/07/09/1028140/ai-voice-actors-sound-human/'
article = Article(url, config=config)
With the configuration set up, we can initialize an article by passing the URL of the article and the configuration.
For this article, we will be using “AI voice actors sound more human than ever — and they’re ready to hire” by MIT Technology.
Download the article
Downloading the article details is as simple as calling article.download()
article.download()
After downloading, you can also view the article HTML content by calling article.html
With the article “downloaded”, you can start extracting useful information by parsing the article.
Parse the article
article.parse()
authors = ", ".join(author for author in article.authors)
title = article.title
date = article.publish_date
text = article.text
image = article.top_image
videos = article.movies
url = article.url
By accessing the values within the article object using the .
dot method, we can get information such as the title of the article, publish date, article URL, images used, video links, and of course, the text of the article that we need.
Above, you see the information printed out.
Performing NLP on the article
With the newspaper3k library, you can even get some natural language properties of the text by calling the nlp
function.
The two pieces of information we can get are:
- Article keywords
- Article summary
Calling .keywords
on our article, and sorting them alphabetically, we see the important keywords are actors, ai, audio, voice, and so on.
To get a summary, we call .summary
Observing the source code, the summary seemed to be obtained by scoring the sentences with a certain algorithm.
Hugging Face models
Instead of using newspaper3k’s summary feature, we will be summarizing our text with Hugging Face models, more specifically, models under the summarization pipeline.
If you haven’t heard of Hugging Face, here’s a brief intro taken from their docs.
What is Hugging Face?
Hugging Face provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, and more in over 100 languages.
Its main goal is to make cutting-edge NLP easier to use for everyone.
With the transformers
library, you can load a model with just a few lines of code, fine-tune them on your own datasets, and share them on their model hub.
Examples of tasks you can do with their models:
For our use case, we will be focusing on summarization!
Clicking on summarization, we get a list of models we can use.
Using Gradio with Hugging Face
To create our app, we will be using Gradio, which allows us to create a UI for our Hugging Face model easily.
With the release of Gradio 2.0, you can even use & mix Hugging Face models with Gradio interfaces and load them up with just 1 line of code!
First, let’s choose a good model for our summarizer app.
Comparing hugging face summarization models
Using Gradio, we can run multiple models in parallel and compare the outputs.
The four models (chosen based on top downloads) we will be using are:
- distilbart-cnn-12–6
- bart-large-cnn (from Facebook)
- pegasus-xsum (from Google)
- distilbart-cnn-6–6 (a more lightweight version of distilbart-cnn-12–6)
View the whole list of Hugging Face summarization models.
First, we load up each model to an interface, and then we can initialize the parallel interface by passing each of our models.
io1 = gr.Interface.load('huggingface/sshleifer/distilbart-cnn-12-6')
io2 = gr.Interface.load("huggingface/facebook/bart-large-cnn")
io3 = gr.Interface.load("huggingface/google/pegasus-xsum")
io4 = gr.Interface.load("huggingface/sshleifer/distilbart-cnn-6-6")
iface = Parallel(io1, io2, io3, io4,
theme='huggingface',
inputs = gr.inputs.Textbox(lines = 10, label="Text"))
iface.launch()
Running launch, we will see this interface within Google Colab.
If you want to open it on a new tab in your browser, click on the external URL
https://<some_numbers>.gradio.app
Now all the interface needs is the text from our article.
To compare the models on our article text, we first get the text by calling our text object (which we created earlier) and copy the output.
Pasting the text into the text box, hit the submit button.
Since there are 4 models running, it might take a while to run.
From the outputs, the facebook/bart-large-cnn model seems to have the best summary as it captures what the new startup is doing and talks about the practical application of AI voices, so we’ll be using that for our app.
Creating a news summarizer app
Instead of having users paste text, a convenient approach is for them to paste a link and have then created a summary for the users.
To do that, we create our own function that extracts the article text using the newspaper3k library.
def extract_article_text(url):
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
article = Article(url, config=config)
article.download()
article.parse()
text = article.text
return text
Another cool functionality of Gradio is it allows you to run functions or models in series, which means, for example, if we have a translator and summarizer model, Gradio will first translate the text passed to it, then summarize it.
What we can do is pass our extract_article_text
function first and then our summarizer into the Series
class, creating the interface for our app.
So the entire flow would be:
link -> extract_article_text -> text -> summarizer_model -> summary
To provide users with some examples, I’ve provided some articles from MIT Technology Review. I also added a simple title and description for the app.
extractor = gr.Interface(extract_article_text, 'text', 'text')
summarizer = gr.Interface.load("huggingface/facebook/bart-large-cnn")
sample_url = [['https://www.technologyreview.com/2021/07/22/1029973/deepmind-alphafold-protein-folding-biology-disease-drugs-proteome/'],
['https://www.technologyreview.com/2021/07/21/1029860/disability-rights-employment-discrimination-ai-hiring/'],
['https://www.technologyreview.com/2021/07/09/1028140/ai-voice-actors-sound-human/']]
desc = '''
Let Hugging Face models summarize articles for you.
Note: Shorter articles generate faster summaries.
This summarizer uses bart-large-cnn model by Facebook
'''
iface = Series(extractor, summarizer,
inputs = gr.inputs.Textbox(
lines = 2,
label = 'URL'
),
outputs = 'text',
title = 'News Summarizer',
theme = 'huggingface',
description = desc,
examples=sample_url)
iface.launch()
After launching it up, you’ll see something like this.
Let’s go over the external URL for a nicer experience.
Presenting our News Summarizer app!
Note that the example URLs will run a lot quicker because the results are cached, so passing in new articles will take more time to summarize, especially if they’re longer.
Now what’s left to do is have fun with the app! You can share the link with your friends to try it out too, but note that they expire in 24 hours.
You saw firsthand how easy it is to use Hugging Face models and Gradio to create a news summarizer app, even if you didn’t know how the models work.
If you want to dive deeper into what the transformers architecture are, which is the underlying architecture of the Hugging Face models, I recommend you read the famous “Attention Is All You Need” paper, or watch these videos below:
- Transformer Neural Networks — EXPLAINED! (Attention is all you need) by CodeEmporium
- Attention Is All You Need by Yannic Kilcher
To learn more about the Hugging Face ecosystem, you can take the free Hugging Face course by the Hugging Face team.
Thanks for reading!
Links
Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!