Scraping 100+ Free Data Science Books with Python

Data Science

Scraping 100+ Free Data Science Books with Python

And using data science to decide which books to read.

Data is information, and having enough information is crucial for making the right decisions. How does one procure data easily from the web? The answer is web scraping.

Web scraping is an essential skill for data scientists to procure the data they need easily. Machine Learning algorithms and experiments require enough data to learn and generalize well on a specific problem. So often enough, data scientists need to get more data to improve models and experiments.

Some popular use cases of web scraping include using it for business intelligence, regulating prices, customer satisfaction using sentiment analysis, and more.

It’s clear web scraping is a powerful tool. However, it’s also recommended to adhere to the best practices for web scraping. Here’s a great article on avoiding being blocked when web scraping.

Scraping 100+ Data Science Books

In this article, I will be scraping the article “100+ Free Data Science Books” from the website learndatasci.com, which contains many useful resources for learning data science.

I’ll be using the Beautiful Soup library, a popular library for web scraping. And the data science libraries to transform and visualize the data and gain insights from it.

Why scrape this article? The goal is to decide which book to read from the huge list of 100 books based on overall rating and total amount of ratings.

As always, here’s where you can find the code for this article:

Before we dive into web scraping, bitgrit’s latest competition, Viral Tweets Prediction Challenge, is ending soon on July 6, 2021!

If you want to apply your data science knowledge to a real-world problem and win cash prizes up to $3000 💵 , sign up for free now! It’s a good learning experience, and you have nothing to lose from participating.

If you’re a beginner and don’t know how to get started, read our recent article — Using Data Science to Predict Viral Tweets — to guide you step-by-step to build a simple model for this competition.

Now let’s start scraping.

Observing the HTML of the books

When you want to scrape something from the internet, you always start by observing what you want to scrape.

In the article, here is how the books are presented.

And this is what it looks like in HTML.

From the inspect tool, we see all the books are within the id BooksWrapper

Each book is within a section class which has the information we need in specific tags:

<div class=“star-ratings”> — Goodreads rating and amount of ratings
<div class=“book-cats”> — Book category
<h2> — Book title
<div class=“meta-auth”> — author name, year
<p> — book description
<a class=”btn”.. >— book link and amazon review link

Now we have an idea of which class and tag to tackle, we can start coding!

Importing libraries

As always, we start by importing the libraries we need.

# web scraping libraries
from urllib.request import urlopen # open urls
from bs4 import BeautifulSoup # extract data from html files

# ds libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8) # default plot size

# word cloud
from wordcloud import WordCloud, STOPWORDS

# interactive tables
from google.colab import data_table

# regex
import re

urllib.request — used to open our website and return HTML data.
bs4 — Beautiful soup library, the star of the show, helps us extract the right data from HTML.
wordcloud — create word cloud plots for our text data analysis
re — python regular expression library

After we have our libraries, we can start creating our beautiful soup object.

Getting data with bs4

url = "https://www.learndatasci.com/free-data-science-books/"

# get html of page
html = urlopen(url)

# create bs object
soup = BeautifulSoup(html, 'lxml') # using lxml parser

Using the urlopen function and passing in the URL, we get our HTML data. After that, we create a bs object using the lxml parser. You can also use other parsers as long as it works in your particular case. Read here for the difference between parsers.

Title of our website

# get title
title = soup.title
print(title.get_text())

...
100+ Free Data Science Books – LearnDataSci

Getting HTML of a single book

books = soup.find_all('section', attrs={"class": ""}) # to prevent getting ad section

book1 = books[0]
print(book1.prettify())

<section>
 <div class="book row" isbn-data="0136042597">
  <div class="col-lg-4">
   <div style="width:100%;">
    <img alt="Artificial Intelligence A Modern Approach, 1st Edition" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Artificial-Intelligence-A-Modern-Approach_2015.width-200.png" width="200"/>
   </div>
  </div>
  <div class="col-lg-8">
   <div class="star-ratings">
    <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
    <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
    <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
    <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
    <img src="https://storage.googleapis.com/lds-static/imgs/star-outline.svg"/>
    <b>
     4.2
    </b>
    <span>
     (342 Ratings)
    </span>
    <button data-tooltip="Good Reads: 4.2">
     ?
    </button>
   </div>
   <div class="star-ratings">
   </div>
   <div class="book-cats">
    Artificial Intelligence
   </div>
   <h2>
    Artificial Intelligence A Modern Approach, 1st Edition
   </h2>
   <span class="meta-auth">
    <b>
     Stuart Russell, 1995
    </b>
   </span>
   <div class="meta-auth-ttl">
   </div>
   <p>
    Comprehensive, up-to-date introduction to the theory and practice of artificial intelligence. Number one in its field, this textbook is ideal for one or two-semester, undergraduate or graduate-level courses in Artificial Intelligence.
   </p>
   <div>
    <a class="btn" href="http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf" rel="nofollow">
     View Free Book
    </a>
    <a class="btn" href="http://www.amazon.com/gp/product/0136042597/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0136042597&amp;linkCode=as2&amp;tag=learnds-20&amp;linkId=3FRORB7P56CEWSK5" rel="nofollow">
     See Reviews
    </a>
   </div>
  </div>
 </div>
</section>

Calling find_all on the section tag, and getting the first occurrence (the first book), you can see it is exactly what it looks like from the inspect page.

Notice I set the attributes (attrs) as empty because there was an ad in one of the section tags which looks like this → <section class = “ad-block”>. Since all the other section doesn’t have a class name, doing so prevents my find_all function from getting the ad.

Searching with bs4

Note that this wasn’t a comprehensive use of bs4, as I’m only doing basic web scraping, so I only touched on some of the searching functionalities on bs4

Here are the methods I used:

soup.find() — first occurrence of class/tag
soup.find_all() — all occurrences of class/tag
soup.find().find() — searching within a class/tag
.get_text() — returns the text of the HTML tag
.prettify() — pretty output of HTML

You can find more functions for searching on the Beautiful Soup documentation.

Getting all the information we need

We already observed which tag we need to get the necessary information, which are:

book rating
the total amount of ratings
book category
book title
author-name
book description
book link
Amazon review link

So let’s write the code needed to get each of this information.

rating = book1.find(class_='star-ratings').find('b').get_text()
total_ratings = book1.find(class_='star-ratings').find('span').get_text()
total_ratings = re.search(r'\d+', total_ratings).group() # get numbers only
book_cat = book1.find(class_='book-cats').get_text()
title = book1.find('h2').get_text()
author, year = book1.find(class_='meta-auth').find('b').get_text().split(', ')
desc = book1.find('p').get_text()
links = book1.find_all('a')
book_link = links[0].get('href')
review_link = links[1].get('href')

print(f"title: {title}")
print(f"category: {book_cat}")
print(f"author: {author}")
print(f"year: {year}")
print(f"rating: {rating}")
print(f"total_ratings: {total_ratings}")
print(f"description: {desc}")
print(f"link: {book_link}")
print(f"review link: {review_link}")

title: Artificial Intelligence A Modern Approach, 1st Edition
category: Artificial Intelligence
author: Stuart Russell
year: 1995
rating: 4.2
total_ratings: 342
description: Comprehensive, up-to-date introduction to the theory and practice of artificial intelligence. Number one in its field, this textbook is ideal for one or two-semester, undergraduate or graduate-level courses in Artificial Intelligence.
link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
review link: http://www.amazon.com/gp/product/0136042597/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0136042597&linkCode=as2&tag=learnds-20&linkId=3FRORB7P56CEWSK5

Most of the information was easy to obtain using find() and get_text() but some required more extraction using python to get the exact info we want.

for total_ratings, the information was like this → (342 Ratings), but we only want the amount. Using re, we can pass in a regex condition \d+ , which means “match any digit (\d) repeatedly (+)”. This will return a list, so we can call the group function to get our number 342.
for author and year, the information gives us the author separated by year like this → Stuart Russell, 1995. To separate this into author and year, we can first split this text by commas using split(‘, ’) and use python’s tuple functionality by passing the result into author and year. However, this is a naive method because later on, there will be some cases that we didn’t consider that will break this method.
For the links, we only want the link itself, so using get(‘href’) can easily help us do that.

As it is with coding, when we use our methods on more data, we will encounter unexpected problems that we did not account for. And to resolve that, we need to revise our code.

Dealing with missing components

The problems, in this case, are missing components from our book information. The issues are:

books without year & multiple authors
books without rating
books without review links
books without description

Note there are different ways you can deal with these issues. Below is my way of dealing with them.

books without year & multiple authors

For the first issue, here are three books that will cause problems with our initial code.

book7 = books[7] # book without year
book35 = books[35] # book without year but multiple author
book17 = books[17] # book with multiple authors

print(book1.find(class_='meta-auth').find('b').get_text())
print(book7.find(class_='meta-auth').find('b').get_text())
print(book35.find(class_='meta-auth').find('b').get_text())
print(book17.find(class_='meta-auth').find('b').get_text())

Stuart Russell, 1995
Jeff Leek
Jeffrey Stanton, Syracuse University
Yoshua Bengio, Ian J. Goodfellow, & Aaron Courville, 2015

As you can see, if we split by comma only, we won’t be able to get the year for books with multiple authors, and for the case of book35, we will be getting “Syracuse University” which is definitely not a year.

To resolve this, what we can do is first search for whether the text has digits. If it does, then only we perform the split. After that, we still need to account for multiple authors, so after splitting, the year will be the last element in the list, and what we can do is grab the last element using [-1] in Python.

# author = book1.find(class_='meta-auth').find('b').get_text()
# author = book7.find(class_='meta-auth').find('b').get_text()
author = book17.find(class_='meta-auth').find('b').get_text()
author = book35.find(class_='meta-auth').find('b').get_text()

# some books don't have year and has multiple authors
if (re.search(r'\d+', author) != None):
  author_year = author.split(", ")
  author = ", ".join(str for str in author_year[:-1])
  year = author_year[-1]
else:
  year = None

print(author)
print(year)

Jeffrey Stanton, Syracuse University
None

If we don’t get a digit, we simply set the year as None.

books without rating

Moving on to books without a rating, we chose book23 which has no rating.

book23 = books[23] # book without rating

print(book1.find(class_='star-ratings').prettify())
print()
print(book23.find(class_='star-ratings').prettify())

<div class="star-ratings">
 <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
 <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
 <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
 <img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
 <img src="https://storage.googleapis.com/lds-static/imgs/star-outline.svg"/>
 <b>
  4.2
 </b>
 <span>
  (342 Ratings)
 </span>
 <button data-tooltip="Good Reads: 4.2">
  ?
 </button>
</div>


<div class="star-ratings">
</div>

You can observe that calling find for that particular book shows that there is no information within the div tag.

Since bs4 find already returns None if there’s nothing in the tag, we can just set a condition for searches that don’t give None, and use back the code we had before.

# rating = book1.find(class_='star-ratings').find('b')
# total_ratings = book1.find(class_='star-ratings').find('span')
rating = book23.find(class_='star-ratings').find('b')
total_ratings = book23.find(class_='star-ratings').find('span')

# some books don't have ratings
if (rating != None and total_ratings != None):
  rating = rating.get_text()
  total_ratings = total_ratings.get_text()
  total_ratings = re.search(r'\d+', total_ratings).group()


print(rating)
print(total_ratings)

None
None

Printing the rating and total_rating, we see that it’s None now

Books without review link

For books with both book link and review link, find_all('a')will return two links. Book8 doesn’t have a review link, so its length is only one.

book8 = books[8] # book without review link

print(len(book1.find_all('a')))
print(len(book8.find_all('a')))

2
1

Since all books have a book link, we only have to check whether the length is 2. If it is, we get review_link. Else, we set it to None.

links = book8.find_all('a')
book_link = links[0].get('href')

if (len(links) == 2):
  review_link = links[1].get('href')
else:
  review_link = None
  
print(book_link)
print(review_link)

http://ciml.info/dl/v0_9/ciml-v0_9-all.pdf
None

For book8 without a review link, you can see it returns None now.

Books without description

book13 = books[13] # book without desc

print(book1.find('p'))
print(book13.find('p'))

<p>Comprehensive, up-to-date introduction to the theory and practice of artificial intelligence. Number one in its field, this textbook is ideal for one or two-semester, undergraduate or graduate-level courses in Artificial Intelligence.</p>
None

For books without description, the find function already solves it for us since it returns None if it doesn’t exist.

After dealing with all those issues, we can start storing our data and build a pandas data frame.

Storing and building our data frame

First, we can create a list object for each of our information to just append it to these lists later on.

title_list = []
book_cat_list = []
author_list = []
year_list = []
rating_list = []
total_ratings_list = []
description_list = []
book_link_list = []
review_link_list = []

Get book info function

To get the information from each book, I created a function, placed the code for getting each information in the function, and append them to their respective list.

def getInfo(book):

  # get and add title data
  title = book.find('h2')
  title_list.append(title.get_text())

  book_cat = book.find(class_='book-cats')
  if book_cat != None:
    book_cat = book_cat.get_text()

  book_cat_list.append(book_cat)

  # get author and year data
  author = book.find(class_='meta-auth').find('b').get_text()

  # some books don't have year and some books have multiple authors
  if (re.search(r'\d+', author) != None):
    author_year = author.split(", ")
    author = ", ".join(str for str in author_year[:-1])
    year = author_year[-1]
  else:
    year = None
  
  author_list.append(author)
  year_list.append(year)

  # get rating and total number of ratings
  rating = book.find(class_='star-ratings').find('b')
  total_ratings = book.find(class_='star-ratings').find('span')

  # some books don't have ratings
  if (rating != None and total_ratings != None):
    rating = rating.get_text()
    total_ratings = total_ratings.get_text()
    total_ratings = re.search(r'\d+', total_ratings).group() # get numbers only

  rating_list.append(rating)
  total_ratings_list.append(total_ratings)

  # get description
  desc = book.find('p')

  # books without description
  if (desc != None):
    desc = desc.get_text()

  description_list.append(desc)

  # get book links and review links
  links = book.find_all('a')
  
  book_link = links[0].get('href')
  book_link_list.append(book_link)

  # Some books don't have links
  if (len(links) == 2):
    review_link = links[1].get('href')
  else:
    review_link = None

  review_link_list.append(review_link)

Note this was a quick and dirty way to get my data. There are better ways to structure the code and make it cleaner, but it works, and that’s what’s important for now.

Using our function, we can iterate each book within books, which is the bs object that contains all the book information.

for book in books:
  getInfo(book)

Building our Pandas data frame

With all our information in lists, we can build our Pandas data frame!

Calling info on our data frame,

df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          97 non-null     object
 1   book_cat       96 non-null     object
 2   author         97 non-null     object
 3   year           80 non-null     object
 4   rating         58 non-null     object
 5   total_ratings  58 non-null     object
 6   description    79 non-null     object
 7   book_link      97 non-null     object
 8   review_link    55 non-null     object
dtypes: object(9)
memory usage: 6.9+ KB

We can see that we only have 97 books, which either means the title is wrong or our scraping had some issue, but no worries. We also see the data type is all object, which we’ll have to fix later.

But first, let’s clean the data.

Data Cleaning

Remember we set values as None previously in our code; they are the missing values in our data frame.

df_books.isnull().sum()

title             0
book_cat          1
author            0
year             17
rating           39
total_ratings    39
description      18
book_link         0
review_link      42
dtype: int64

What can we do to replace these missing values? Here’s what I thought of:

book_cat — check the book, and impute it ourselves manually since it’s only one book
year — leave it empty for now
rating — replace with 0.0
total_ratings — replace with 0
description & review_link — replace with "None"

If you want to go a step further, you can create a script that iterates over the book, query the title on sites like Amazon or Goodreads, then grab the information you’re missing. I won’t do it here, but you’re welcomed to try that out!

If you want to brush up your data cleaning skills, check out our Data Cleaning using Python article.

Let’s start with the book with the missing category!

We can get the specific column where book_cat is null. We can also bring up the categories and figure out which is suitable for the particular book.

Since it’s under Artificial Intelligence, I chose to replace it with that.

df_books.fillna({'rating': '0.0'}, inplace=True)
df_books.fillna({'total_ratings':'0'}, inplace=True)
df_books.fillna({'book_cat': 'Artificial Intelligence'}, inplace=True)
df_books.fillna({'description':'None'}, inplace=True)
df_books.fillna({'review_link':'None'}, inplace=True)
df_books.isnull().sum()

title             0
book_cat          0
author            0
year             17
rating            0
total_ratings     0
description       0
book_link         0
review_link       0
dtype: int64

After replacing the missing values in the other columns, we only have the year column with missing values, which I’ll leave empty.

Data Transformation

Next up, we transform the data type of some of our columns.

Columns to convert

year → datetime
rating → float
total_rating → integer

Pandas has a useful function convert_dtypes() which converts columns to best possible dtypes. It’s not very useful for our case since all our data types are objects, but this will convert all our columns to strings.

# data transformation
df_books = df_books.convert_dtypes() # convert all to string

# convert to datetime
df_books['year'] = df_books['year'].astype('Int64')
df_books['rating'] = df_books['rating'].astype('float64')
df_books['total_ratings'] = df_books['total_ratings'].astype('Int64')

df_books.dtypes

title             string
book_cat          string
author            string
year               Int64
rating           float64
total_ratings      Int64
description       string
book_link         string
review_link       string
dtype: object

Then for year, we convert it to Int64, which can support NA values. We do the same for rating and total_ratings.

Now our data is ready, and it’s time to visualize it!

Exploratory Data Analysis

Let’s visualize our data and see if we can find anything interesting from these 100 books.

For text data, I decided to build a plot word cloud function.

def plot_wordcloud(text, file_name, stopwords_list=[], max_words = 500):
  # create stopword list
  stopwords = set(STOPWORDS)
  stopwords.update(stopwords_list)

  # generate word cloud 
  wordcloud = WordCloud(width=1000, height = 600,
                        stopwords=stopwords,
                        max_words = max_words,
                        background_color="white").generate(text)

  # generate plot
  wordcloud.to_file(file_name + ".png");

  # Display the generated image:

  plt.figure(figsize=(12,8))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off");

Word cloud of book titles

By joining all the texts in the title column, we can calculate how many individual words there are.

text = " ".join(title for title in df_books.title)
print("There are {} words in the combination of all titles.".format(len(set(text.split(" ")))))

There are 211 words in the combination of all titles.

With the text ready, we can plot our word cloud.

plot_wordcloud(text, "100ds_titles")

We see that the words Python and Data are predominant, with Machine Learning and Learning coming in close. This makes sense since Python is the most popular language for Data Science.

Word cloud of book descriptions

text = " ".join(desc for desc in df_books.description)
print("There are {} words in the combination of all description.".format(len(set(text.split(" ")))))

There are 990 words in the combination of all description.

There are over 990 individual words in the book descriptions.

Let’s see what we get when we plot the word cloud.

This column had “None” imputed, so let’s add it to our list of stop words.

plot_wordcloud(text, "100ds_book_descriptions", ['None'], 1000) # add None to stopwords

We see the word data is highly prevalent, along with the word book and programming. The words Python and introduction is also coming in close. This suggests the books we scraped are mostly introductory programming books related to data and are in the language Python.

Book category

Our books had many categories; which one was the most common?

sns.histplot(data=df_books, y='book_cat', discrete=True);

From our plot, we see Data Mining and Machine learning was the most common, with Learning Languages coming in second.

Book year

We can also plot the year count and find out what year the books are most commonly released.

sns.histplot(data=df_books, x='year', discrete=True);

From our plot, we see the year 2015 was the most common in our list of 100 books, exactly 18 books.

Book Rating and Total Ratings

Since the rating and total_ratings column had high amounts of missing data, 40 out of 97 rows (almost 50%), we can expect the data to be quite skewed.

Calling describe on the columns, we see 4.6 is the highest rating for the books, and the max amount of rating is 1659.

Plotting a histogram of the columns:

Considering we have quite a lot of missing data and the fact we have very little data, we see that the distribution is pretty skewed.

This is more evident when we plot a boxplot.

For total_ratings, notice how three points are extreme outliers.

We can also plot both these columns together with a strip plot.

We see that there are three books with a high rating and total number of ratings.

Which book to read?

Let’s say you stumble upon these 100 books, and you have no idea what to read; why not look at the rating and total ratings to help you decide? Just like movie ratings, we can usually trust the consensus on whether the book is of good quality.

By scraping the data, visualizing it, we can gain insights that can help us make decisions, which is essentially what Data Science is.

Which are the three books with high total ratings and ratings from our plot?

Based on the last plot, we saw three data points with total ratings above 1500 and ratings above 4.0 (or 4.2 to be more exact).

df_books[(df_books['total_ratings'] > 1500) & (df_books['rating'] > 4.0)].iloc[:, :6]

Voila! The three books are:

Automate the Boring Stuff with Python
An Introduction to Statistical Learning
Pattern Recognition and Machine Learning

What are the top 10 books in total ratings?

We can sort our data frame based on columns, so let’s see what the top 10 books are in total ratings.

Aside from the top three books earlier, we see some more popular books in data science, mainly NLP with python, Python for Everybody, and Artificial Intelligence A Modern Approach.

What are the top 10 books in terms of total rating and rating?

How about sorting by rating and total_rating?

df_books.sort_values(by=['rating', 'total_ratings'], ascending=False)[:10].iloc[:, :6]

We see that some new books popped up on the list. However, some of the books have few ratings. For example, the book Elementary Differential Equations has only 5 ratings, and it’s hard to say whether we can trust the book is good or not.

Can You trust the results?

To be , our data set is tiny. With only 100 books, plus around 40 ratings and total ratings being missing, our results will be biased. To make sure we make the data-driven decision, we should increase our sample size and impute the missing data with more scraping.

If you want to take this even further, you can also calculate the weighted rating of the books the same way IMDB ranks their top films.

Interactive Data Table.

If you’re in Google Collab, you can run this command to get an interactive table and do more exploration like sorting each column, filtering, etc.

data_table.DataTable(df_books, include_index=False, num_rows_per_page=5)

You now have 100 books in a data frame.

You can decide to export it into a CSV file like this.

df_books.to_csv('100_DS_books.csv', index=False)

A cool thing you can do with this CSV file is you can iterate over the book_link column, and download all 100 books to your computer.

Thanks for reading!

That’s all for this article, and I hope you got a glimpse of the power of web scraping!

Here are some resources / tutorials on web scraping

Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!