Build a News Article Summarizer App with Hugging Face 🤗 and Gradio

Natural Language Processing

Build a News Article Summarizer App with Hugging Face 🤗 and Gradio

Building a text summarizer with Newspaper3k, Hugging Face, and Gradio

Have a ton of articles bookmarked to read later but never got to them? Don’t have the time to read long articles?

Don’t you wish you could summarize them all and get the gist of the articles?

With Hugging Face 🤗 , you can!

In this article, I will be using the Newspaper3k library to extract text from article links and summarize them with Hugging Face pre-trained summarization models. To wrap it all up, I created a simple user interface with Gradio so anyone can get summaries of articles just with their URL!

As always, here’s where you can find the code for this article:

Install Dependencies

Before loading, we install some dependencies we need that aren’t installed by default on Google Colab.

!pip install newspaper3k transformers gradio --quiet

Load libraries

newspaper3k — News, full-text, and article metadata extraction in Python 3.
nlkt — To provide newspaper3k with tokenizing functionalities
transformers — Provides you with thousands of state-of-the-art pre-trained models for a variety of natural language processing (NLP) tasks
Gradio — A customizable graphical interface for Machine Learning models or even arbitrary Python functions

Creating an article object with Newspaper3k

Before we initialize an article object to get its content, one thing we have to do is set up a user agent, which can allow us to grab information from certain websites without getting HTTP errors such as HTTP 403 Forbidden client error.

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

url = 'https://www.technologyreview.com/2021/07/09/1028140/ai-voice-actors-sound-human/'
article = Article(url, config=config)

With the configuration set up, we can initialize an article by passing the URL of the article and the configuration.

For this article, we will be using “AI voice actors sound more human than ever — and they’re ready to hire” by MIT Technology.

Download the article

Downloading the article details is as simple as calling article.download()

article.download()

After downloading, you can also view the article HTML content by calling article.html

With the article “downloaded”, you can start extracting useful information by parsing the article.

Parse the article

article.parse() 

authors = ", ".join(author for author in article.authors)
title = article.title
date = article.publish_date
text = article.text
image = article.top_image
videos = article.movies
url = article.url

By accessing the values within the article object using the . dot method, we can get information such as the title of the article, publish date, article URL, images used, video links, and of course, the text of the article that we need.

Above, you see the information printed out.

Performing NLP on the article

With the newspaper3k library, you can even get some natural language properties of the text by calling the nlp function.

The two pieces of information we can get are:

Article keywords
Article summary

Calling .keywords on our article, and sorting them alphabetically, we see the important keywords are actors, ai, audio, voice, and so on.

To get a summary, we call .summary

Observing the source code, the summary seemed to be obtained by scoring the sentences with a certain algorithm.

Hugging Face models

Instead of using newspaper3k’s summary feature, we will be summarizing our text with Hugging Face models, more specifically, models under the summarization pipeline.

If you haven’t heard of Hugging Face, here’s a brief intro taken from their docs.

What is Hugging Face?

Hugging Face provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, and more in over 100 languages.

Its main goal is to make cutting-edge NLP easier to use for everyone.

With the transformers library, you can load a model with just a few lines of code, fine-tune them on your own datasets, and share them on their model hub.

Examples of tasks you can do with their models:

screenshot of tasks in hugging face models page

For our use case, we will be focusing on summarization!

Clicking on summarization, we get a list of models we can use.

screenshot of hugging face models for summarization

Using Gradio with Hugging Face

To create our app, we will be using Gradio, which allows us to create a UI for our Hugging Face model easily.

With the release of Gradio 2.0, you can even use & mix Hugging Face models with Gradio interfaces and load them up with just 1 line of code!

First, let’s choose a good model for our summarizer app.

Comparing hugging face summarization models

Using Gradio, we can run multiple models in parallel and compare the outputs.

The four models (chosen based on top downloads) we will be using are:

distilbart-cnn-12–6
bart-large-cnn (from Facebook)
pegasus-xsum (from Google)
distilbart-cnn-6–6 (a more lightweight version of distilbart-cnn-12–6)

View the whole list of Hugging Face summarization models.

First, we load up each model to an interface, and then we can initialize the parallel interface by passing each of our models.

io1 = gr.Interface.load('huggingface/sshleifer/distilbart-cnn-12-6')
io2 = gr.Interface.load("huggingface/facebook/bart-large-cnn")
io3 = gr.Interface.load("huggingface/google/pegasus-xsum")  
io4 = gr.Interface.load("huggingface/sshleifer/distilbart-cnn-6-6")                   

iface = Parallel(io1, io2, io3, io4,
                 theme='huggingface', 
                 inputs = gr.inputs.Textbox(lines = 10, label="Text"))

iface.launch()

Running launch, we will see this interface within Google Colab.

If you want to open it on a new tab in your browser, click on the external URL https://<some_numbers>.gradio.app

Now all the interface needs is the text from our article.

To compare the models on our article text, we first get the text by calling our text object (which we created earlier) and copy the output.

Click the copy button in Google Colab, as pointed out by the red arrow

Pasting the text into the text box, hit the submit button.

Since there are 4 models running, it might take a while to run.

Hugging Face models running in parallel, creating 4 different summaries

From the outputs, the facebook/bart-large-cnn model seems to have the best summary as it captures what the new startup is doing and talks about the practical application of AI voices, so we’ll be using that for our app.

Creating a news summarizer app

Instead of having users paste text, a convenient approach is for them to paste a link and have then created a summary for the users.

To do that, we create our own function that extracts the article text using the newspaper3k library.

def extract_article_text(url):
  USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
  config = Config()
  config.browser_user_agent = USER_AGENT
  config.request_timeout = 10

  article = Article(url, config=config)
  article.download()
  article.parse()
  text = article.text
  return text

Another cool functionality of Gradio is it allows you to run functions or models in series, which means, for example, if we have a translator and summarizer model, Gradio will first translate the text passed to it, then summarize it.

What we can do is pass our extract_article_text function first and then our summarizer into the Series class, creating the interface for our app.

So the entire flow would be:

link -> extract_article_text -> text -> summarizer_model -> summary

To provide users with some examples, I’ve provided some articles from MIT Technology Review. I also added a simple title and description for the app.

extractor = gr.Interface(extract_article_text, 'text', 'text')
summarizer = gr.Interface.load("huggingface/facebook/bart-large-cnn")

sample_url = [['https://www.technologyreview.com/2021/07/22/1029973/deepmind-alphafold-protein-folding-biology-disease-drugs-proteome/'],
              ['https://www.technologyreview.com/2021/07/21/1029860/disability-rights-employment-discrimination-ai-hiring/'],
              ['https://www.technologyreview.com/2021/07/09/1028140/ai-voice-actors-sound-human/']]

desc =  '''
        Let Hugging Face models summarize articles for you. 
        Note: Shorter articles generate faster summaries.
        This summarizer uses bart-large-cnn model by Facebook
        '''

iface = Series(extractor, summarizer, 
  inputs = gr.inputs.Textbox(
      lines = 2,
      label = 'URL'
  ),
  outputs = 'text',
  title = 'News Summarizer',
  theme = 'huggingface',
  description = desc,
  examples=sample_url)

iface.launch()

After launching it up, you’ll see something like this.

Let’s go over the external URL for a nicer experience.

Presenting our News Summarizer app!

Note that the example URLs will run a lot quicker because the results are cached, so passing in new articles will take more time to summarize, especially if they’re longer.

Now what’s left to do is have fun with the app! You can share the link with your friends to try it out too, but note that they expire in 24 hours.

You saw firsthand how easy it is to use Hugging Face models and Gradio to create a news summarizer app, even if you didn’t know how the models work.

If you want to dive deeper into what the transformers architecture are, which is the underlying architecture of the Hugging Face models, I recommend you read the famous “Attention Is All You Need” paper, or watch these videos below:

Transformer Neural Networks — EXPLAINED! (Attention is all you need) by CodeEmporium
Attention Is All You Need by Yannic Kilcher

To learn more about the Hugging Face ecosystem, you can take the free Hugging Face course by the Hugging Face team.

Thanks for reading!

Links

Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!

AI DataScience NLP Programming