Data Science
Scraping 100+ Free Data Science Books with Python
And using data science to decide which books to read.
Data is information, and having enough information is crucial for making the right decisions. How does one procure data easily from the web? The answer is web scraping.
Web scraping is an essential skill for data scientists to procure the data they need easily. Machine Learning algorithms and experiments require enough data to learn and generalize well on a specific problem. So often enough, data scientists need to get more data to improve models and experiments.
Some popular use cases of web scraping include using it for business intelligence, regulating prices, customer satisfaction using sentiment analysis, and more.
It’s clear web scraping is a powerful tool. However, it’s also recommended to adhere to the best practices for web scraping. Here’s a great article on avoiding being blocked when web scraping.
Scraping 100+ Data Science Books
In this article, I will be scraping the article “100+ Free Data Science Books” from the website learndatasci.com, which contains many useful resources for learning data science.
I’ll be using the Beautiful Soup library, a popular library for web scraping. And the data science libraries to transform and visualize the data and gain insights from it.
Why scrape this article? The goal is to decide which book to read from the huge list of 100 books based on overall rating and total amount of ratings.
As always, here’s where you can find the code for this article:
Before we dive into web scraping, bitgrit’s latest competition, Viral Tweets Prediction Challenge, is ending soon on July 6, 2021!
If you want to apply your data science knowledge to a real-world problem and win cash prizes up to $3000 💵 , sign up for free now! It’s a good learning experience, and you have nothing to lose from participating.
If you’re a beginner and don’t know how to get started, read our recent article — Using Data Science to Predict Viral Tweets — to guide you step-by-step to build a simple model for this competition.
Now let’s start scraping.
Observing the HTML of the books
When you want to scrape something from the internet, you always start by observing what you want to scrape.
In the article, here is how the books are presented.
And this is what it looks like in HTML.
From the inspect tool, we see all the books are within the id BooksWrapper
Each book is within a section
class which has the information we need in specific tags:
<div class=“star-ratings”>
— Goodreads rating and amount of ratings<div class=“book-cats”>
— Book category<h2>
— Book title<div class=“meta-auth”>
— author name, year<p>
— book description<a class=”btn”.. >
— book link and amazon review link
Now we have an idea of which class and tag to tackle, we can start coding!
Importing libraries
As always, we start by importing the libraries we need.
# web scraping libraries
from urllib.request import urlopen # open urls
from bs4 import BeautifulSoup # extract data from html files
# ds libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8) # default plot size
# word cloud
from wordcloud import WordCloud, STOPWORDS
# interactive tables
from google.colab import data_table
# regex
import re
urllib.request
— used to open our website and return HTML data.bs4
— Beautiful soup library, the star of the show, helps us extract the right data from HTML.wordcloud
— create word cloud plots for our text data analysisre
— python regular expression library
After we have our libraries, we can start creating our beautiful soup object.
Getting data with bs4
url = "https://www.learndatasci.com/free-data-science-books/"
# get html of page
html = urlopen(url)
# create bs object
soup = BeautifulSoup(html, 'lxml') # using lxml parser
Using the urlopen
function and passing in the URL, we get our HTML data. After that, we create a bs object using the lxml
parser. You can also use other parsers as long as it works in your particular case. Read here for the difference between parsers.
Title of our website
# get title
title = soup.title
print(title.get_text())
...
100+ Free Data Science Books – LearnDataSci
Getting HTML of a single book
books = soup.find_all('section', attrs={"class": ""}) # to prevent getting ad section
book1 = books[0]
print(book1.prettify())
<section>
<div class="book row" isbn-data="0136042597">
<div class="col-lg-4">
<div style="width:100%;">
<img alt="Artificial Intelligence A Modern Approach, 1st Edition" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Artificial-Intelligence-A-Modern-Approach_2015.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-8">
<div class="star-ratings">
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/star-outline.svg"/>
<b>
4.2
</b>
<span>
(342 Ratings)
</span>
<button data-tooltip="Good Reads: 4.2">
?
</button>
</div>
<div class="star-ratings">
</div>
<div class="book-cats">
Artificial Intelligence
</div>
<h2>
Artificial Intelligence A Modern Approach, 1st Edition
</h2>
<span class="meta-auth">
<b>
Stuart Russell, 1995
</b>
</span>
<div class="meta-auth-ttl">
</div>
<p>
Comprehensive, up-to-date introduction to the theory and practice of artificial intelligence. Number one in its field, this textbook is ideal for one or two-semester, undergraduate or graduate-level courses in Artificial Intelligence.
</p>
<div>
<a class="btn" href="http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf" rel="nofollow">
View Free Book
</a>
<a class="btn" href="http://www.amazon.com/gp/product/0136042597/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0136042597&linkCode=as2&tag=learnds-20&linkId=3FRORB7P56CEWSK5" rel="nofollow">
See Reviews
</a>
</div>
</div>
</div>
</section>
Calling find_all
on the section
tag, and getting the first occurrence (the first book), you can see it is exactly what it looks like from the inspect page.
Notice I set the attributes (attrs
) as empty because there was an ad in one of the section tags which looks like this → <section class = “ad-block”>
. Since all the other section doesn’t have a class name, doing so prevents my find_all
function from getting the ad.
Searching with bs4
Note that this wasn’t a comprehensive use of bs4, as I’m only doing basic web scraping, so I only touched on some of the searching functionalities on bs4
Here are the methods I used:
soup.find()
— first occurrence of class/tagsoup.find_all()
— all occurrences of class/tagsoup.find().find()
— searching within a class/tag.get_text()
— returns the text of the HTML tag.prettify()
— pretty output of HTML
You can find more functions for searching on the Beautiful Soup documentation.
Getting all the information we need
We already observed which tag we need to get the necessary information, which are:
- book rating
- the total amount of ratings
- book category
- book title
- author-name
- book description
- book link
- Amazon review link
So let’s write the code needed to get each of this information.
rating = book1.find(class_='star-ratings').find('b').get_text()
total_ratings = book1.find(class_='star-ratings').find('span').get_text()
total_ratings = re.search(r'\d+', total_ratings).group() # get numbers only
book_cat = book1.find(class_='book-cats').get_text()
title = book1.find('h2').get_text()
author, year = book1.find(class_='meta-auth').find('b').get_text().split(', ')
desc = book1.find('p').get_text()
links = book1.find_all('a')
book_link = links[0].get('href')
review_link = links[1].get('href')
print(f"title: {title}")
print(f"category: {book_cat}")
print(f"author: {author}")
print(f"year: {year}")
print(f"rating: {rating}")
print(f"total_ratings: {total_ratings}")
print(f"description: {desc}")
print(f"link: {book_link}")
print(f"review link: {review_link}")
title: Artificial Intelligence A Modern Approach, 1st Edition
category: Artificial Intelligence
author: Stuart Russell
year: 1995
rating: 4.2
total_ratings: 342
description: Comprehensive, up-to-date introduction to the theory and practice of artificial intelligence. Number one in its field, this textbook is ideal for one or two-semester, undergraduate or graduate-level courses in Artificial Intelligence.
link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
review link: http://www.amazon.com/gp/product/0136042597/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0136042597&linkCode=as2&tag=learnds-20&linkId=3FRORB7P56CEWSK5
Most of the information was easy to obtain using find()
and get_text()
but some required more extraction using python to get the exact info we want.
- for
total_ratings
, the information was like this → (342 Ratings), but we only want the amount. Usingre
, we can pass in a regex condition\d+
, which means “match any digit (\d
) repeatedly (+
)”. This will return a list, so we can call thegroup
function to get our number 342. - for
author
andyear
, the information gives us the author separated by year like this → Stuart Russell, 1995. To separate this into author and year, we can first split this text by commas usingsplit(‘, ’)
and use python’s tuple functionality by passing the result into author and year. However, this is a naive method because later on, there will be some cases that we didn’t consider that will break this method. - For the links, we only want the link itself, so using
get(‘href’)
can easily help us do that.
As it is with coding, when we use our methods on more data, we will encounter unexpected problems that we did not account for. And to resolve that, we need to revise our code.
Dealing with missing components
The problems, in this case, are missing components from our book information. The issues are:
- books without year & multiple authors
- books without rating
- books without review links
- books without description
Note there are different ways you can deal with these issues. Below is my way of dealing with them.
books without year & multiple authors
For the first issue, here are three books that will cause problems with our initial code.
book7 = books[7] # book without year
book35 = books[35] # book without year but multiple author
book17 = books[17] # book with multiple authors
print(book1.find(class_='meta-auth').find('b').get_text())
print(book7.find(class_='meta-auth').find('b').get_text())
print(book35.find(class_='meta-auth').find('b').get_text())
print(book17.find(class_='meta-auth').find('b').get_text())
Stuart Russell, 1995
Jeff Leek
Jeffrey Stanton, Syracuse University
Yoshua Bengio, Ian J. Goodfellow, & Aaron Courville, 2015
As you can see, if we split by comma only, we won’t be able to get the year for books with multiple authors, and for the case of book35
, we will be getting “Syracuse University” which is definitely not a year.
To resolve this, what we can do is first search for whether the text has digits. If it does, then only we perform the split. After that, we still need to account for multiple authors, so after splitting, the year will be the last element in the list, and what we can do is grab the last element using [-1]
in Python.
# author = book1.find(class_='meta-auth').find('b').get_text()
# author = book7.find(class_='meta-auth').find('b').get_text()
author = book17.find(class_='meta-auth').find('b').get_text()
author = book35.find(class_='meta-auth').find('b').get_text()
# some books don't have year and has multiple authors
if (re.search(r'\d+', author) != None):
author_year = author.split(", ")
author = ", ".join(str for str in author_year[:-1])
year = author_year[-1]
else:
year = None
print(author)
print(year)
Jeffrey Stanton, Syracuse University
None
If we don’t get a digit, we simply set the year as None
.
books without rating
Moving on to books without a rating, we chose book23
which has no rating.
book23 = books[23] # book without rating
print(book1.find(class_='star-ratings').prettify())
print()
print(book23.find(class_='star-ratings').prettify())
<div class="star-ratings">
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/filled-star.svg"/>
<img src="https://storage.googleapis.com/lds-static/imgs/star-outline.svg"/>
<b>
4.2
</b>
<span>
(342 Ratings)
</span>
<button data-tooltip="Good Reads: 4.2">
?
</button>
</div>
<div class="star-ratings">
</div>
You can observe that calling find for that particular book shows that there is no information within the div
tag.
Since bs4 find already returns None
if there’s nothing in the tag, we can just set a condition for searches that don’t give None
, and use back the code we had before.
# rating = book1.find(class_='star-ratings').find('b')
# total_ratings = book1.find(class_='star-ratings').find('span')
rating = book23.find(class_='star-ratings').find('b')
total_ratings = book23.find(class_='star-ratings').find('span')
# some books don't have ratings
if (rating != None and total_ratings != None):
rating = rating.get_text()
total_ratings = total_ratings.get_text()
total_ratings = re.search(r'\d+', total_ratings).group()
print(rating)
print(total_ratings)
None
None
Printing the rating
and total_rating
, we see that it’s None
now
Books without review link
For books with both book link and review link, find_all('a')
will return two links. Book8
doesn’t have a review link, so its length is only one.
book8 = books[8] # book without review link
print(len(book1.find_all('a')))
print(len(book8.find_all('a')))
2
1
Since all books have a book link, we only have to check whether the length is 2. If it is, we get review_link
. Else, we set it to None
.
links = book8.find_all('a')
book_link = links[0].get('href')
if (len(links) == 2):
review_link = links[1].get('href')
else:
review_link = None
print(book_link)
print(review_link)
http://ciml.info/dl/v0_9/ciml-v0_9-all.pdf
None
For book8
without a review link, you can see it returns None
now.
Books without description
book13 = books[13] # book without desc
print(book1.find('p'))
print(book13.find('p'))
<p>Comprehensive, up-to-date introduction to the theory and practice of artificial intelligence. Number one in its field, this textbook is ideal for one or two-semester, undergraduate or graduate-level courses in Artificial Intelligence.</p>
None
For books without description, the find function already solves it for us since it returns None
if it doesn’t exist.
After dealing with all those issues, we can start storing our data and build a pandas data frame.
Storing and building our data frame
First, we can create a list object for each of our information to just append it to these lists later on.
title_list = []
book_cat_list = []
author_list = []
year_list = []
rating_list = []
total_ratings_list = []
description_list = []
book_link_list = []
review_link_list = []
Get book info function
To get the information from each book, I created a function, placed the code for getting each information in the function, and append them to their respective list.
def getInfo(book):
# get and add title data
title = book.find('h2')
title_list.append(title.get_text())
book_cat = book.find(class_='book-cats')
if book_cat != None:
book_cat = book_cat.get_text()
book_cat_list.append(book_cat)
# get author and year data
author = book.find(class_='meta-auth').find('b').get_text()
# some books don't have year and some books have multiple authors
if (re.search(r'\d+', author) != None):
author_year = author.split(", ")
author = ", ".join(str for str in author_year[:-1])
year = author_year[-1]
else:
year = None
author_list.append(author)
year_list.append(year)
# get rating and total number of ratings
rating = book.find(class_='star-ratings').find('b')
total_ratings = book.find(class_='star-ratings').find('span')
# some books don't have ratings
if (rating != None and total_ratings != None):
rating = rating.get_text()
total_ratings = total_ratings.get_text()
total_ratings = re.search(r'\d+', total_ratings).group() # get numbers only
rating_list.append(rating)
total_ratings_list.append(total_ratings)
# get description
desc = book.find('p')
# books without description
if (desc != None):
desc = desc.get_text()
description_list.append(desc)
# get book links and review links
links = book.find_all('a')
book_link = links[0].get('href')
book_link_list.append(book_link)
# Some books don't have links
if (len(links) == 2):
review_link = links[1].get('href')
else:
review_link = None
review_link_list.append(review_link)
Note this was a quick and dirty way to get my data. There are better ways to structure the code and make it cleaner, but it works, and that’s what’s important for now.
Using our function, we can iterate each book within books, which is the bs object that contains all the book information.
for book in books:
getInfo(book)
Building our Pandas data frame
With all our information in lists, we can build our Pandas data frame!
Calling info
on our data frame,
df_books.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 97 non-null object
1 book_cat 96 non-null object
2 author 97 non-null object
3 year 80 non-null object
4 rating 58 non-null object
5 total_ratings 58 non-null object
6 description 79 non-null object
7 book_link 97 non-null object
8 review_link 55 non-null object
dtypes: object(9)
memory usage: 6.9+ KB
We can see that we only have 97 books, which either means the title is wrong or our scraping had some issue, but no worries. We also see the data type is all object, which we’ll have to fix later.
But first, let’s clean the data.
Data Cleaning
Remember we set values as None
previously in our code; they are the missing values in our data frame.
df_books.isnull().sum()
title 0
book_cat 1
author 0
year 17
rating 39
total_ratings 39
description 18
book_link 0
review_link 42
dtype: int64
What can we do to replace these missing values? Here’s what I thought of:
book_cat
— check the book, and impute it ourselves manually since it’s only one bookyear
— leave it empty for nowrating
— replace with0.0
total_ratings
— replace with0
description
&review_link
— replace with"None"
If you want to go a step further, you can create a script that iterates over the book, query the title on sites like Amazon or Goodreads, then grab the information you’re missing. I won’t do it here, but you’re welcomed to try that out!
If you want to brush up your data cleaning skills, check out our Data Cleaning using Python article.
Let’s start with the book with the missing category!
We can get the specific column where book_cat
is null. We can also bring up the categories and figure out which is suitable for the particular book.
Since it’s under Artificial Intelligence, I chose to replace it with that.
df_books.fillna({'rating': '0.0'}, inplace=True)
df_books.fillna({'total_ratings':'0'}, inplace=True)
df_books.fillna({'book_cat': 'Artificial Intelligence'}, inplace=True)
df_books.fillna({'description':'None'}, inplace=True)
df_books.fillna({'review_link':'None'}, inplace=True)
df_books.isnull().sum()
title 0
book_cat 0
author 0
year 17
rating 0
total_ratings 0
description 0
book_link 0
review_link 0
dtype: int64
After replacing the missing values in the other columns, we only have the year column with missing values, which I’ll leave empty.
Data Transformation
Next up, we transform the data type of some of our columns.
Columns to convert
year → datetime
rating → float
total_rating → integer
Pandas has a useful function convert_dtypes() which converts columns to best possible dtypes. It’s not very useful for our case since all our data types are objects, but this will convert all our columns to strings.
# data transformation
df_books = df_books.convert_dtypes() # convert all to string
# convert to datetime
df_books['year'] = df_books['year'].astype('Int64')
df_books['rating'] = df_books['rating'].astype('float64')
df_books['total_ratings'] = df_books['total_ratings'].astype('Int64')
df_books.dtypes
title string
book_cat string
author string
year Int64
rating float64
total_ratings Int64
description string
book_link string
review_link string
dtype: object
Then for year
, we convert it to Int64
, which can support NA
values. We do the same for rating
and total_ratings
.
Now our data is ready, and it’s time to visualize it!
Exploratory Data Analysis
Let’s visualize our data and see if we can find anything interesting from these 100 books.
For text data, I decided to build a plot word cloud function.
def plot_wordcloud(text, file_name, stopwords_list=[], max_words = 500):
# create stopword list
stopwords = set(STOPWORDS)
stopwords.update(stopwords_list)
# generate word cloud
wordcloud = WordCloud(width=1000, height = 600,
stopwords=stopwords,
max_words = max_words,
background_color="white").generate(text)
# generate plot
wordcloud.to_file(file_name + ".png");
# Display the generated image:
plt.figure(figsize=(12,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off");
Word cloud of book titles
By joining all the texts in the title column, we can calculate how many individual words there are.
text = " ".join(title for title in df_books.title)
print("There are {} words in the combination of all titles.".format(len(set(text.split(" ")))))
There are 211 words in the combination of all titles.
With the text ready, we can plot our word cloud.
plot_wordcloud(text, "100ds_titles")
We see that the words Python and Data are predominant, with Machine Learning and Learning coming in close. This makes sense since Python is the most popular language for Data Science.
Word cloud of book descriptions
text = " ".join(desc for desc in df_books.description)
print("There are {} words in the combination of all description.".format(len(set(text.split(" ")))))
There are 990 words in the combination of all description.
There are over 990 individual words in the book descriptions.
Let’s see what we get when we plot the word cloud.
This column had “None
” imputed, so let’s add it to our list of stop words.
plot_wordcloud(text, "100ds_book_descriptions", ['None'], 1000) # add None to stopwords
We see the word data is highly prevalent, along with the word book and programming. The words Python and introduction is also coming in close. This suggests the books we scraped are mostly introductory programming books related to data and are in the language Python.
Book category
Our books had many categories; which one was the most common?
sns.histplot(data=df_books, y='book_cat', discrete=True);
From our plot, we see Data Mining and Machine learning was the most common, with Learning Languages coming in second.
Book year
We can also plot the year count and find out what year the books are most commonly released.
sns.histplot(data=df_books, x='year', discrete=True);
From our plot, we see the year 2015 was the most common in our list of 100 books, exactly 18 books.
Book Rating and Total Ratings
Since the rating and total_ratings column had high amounts of missing data, 40 out of 97 rows (almost 50%), we can expect the data to be quite skewed.
Calling describe on the columns, we see 4.6 is the highest rating for the books, and the max amount of rating is 1659.
Plotting a histogram of the columns:
Considering we have quite a lot of missing data and the fact we have very little data, we see that the distribution is pretty skewed.
This is more evident when we plot a boxplot.
For total_ratings
, notice how three points are extreme outliers.
We can also plot both these columns together with a strip plot.
We see that there are three books with a high rating and total number of ratings.
Which book to read?
Let’s say you stumble upon these 100 books, and you have no idea what to read; why not look at the rating and total ratings to help you decide? Just like movie ratings, we can usually trust the consensus on whether the book is of good quality.
By scraping the data, visualizing it, we can gain insights that can help us make decisions, which is essentially what Data Science is.
Which are the three books with high total ratings and ratings from our plot?
Based on the last plot, we saw three data points with total ratings above 1500 and ratings above 4.0 (or 4.2 to be more exact).
df_books[(df_books['total_ratings'] > 1500) & (df_books['rating'] > 4.0)].iloc[:, :6]
Voila! The three books are:
- Automate the Boring Stuff with Python
- An Introduction to Statistical Learning
- Pattern Recognition and Machine Learning
What are the top 10 books in total ratings?
We can sort our data frame based on columns, so let’s see what the top 10 books are in total ratings.
Aside from the top three books earlier, we see some more popular books in data science, mainly NLP with python, Python for Everybody, and Artificial Intelligence A Modern Approach.
What are the top 10 books in terms of total rating and rating?
How about sorting by rating
and total_rating
?
df_books.sort_values(by=['rating', 'total_ratings'], ascending=False)[:10].iloc[:, :6]
We see that some new books popped up on the list. However, some of the books have few ratings. For example, the book Elementary Differential Equations has only 5 ratings, and it’s hard to say whether we can trust the book is good or not.
Can You trust the results?
To be , our data set is tiny. With only 100 books, plus around 40 ratings and total ratings being missing, our results will be biased. To make sure we make the data-driven decision, we should increase our sample size and impute the missing data with more scraping.
If you want to take this even further, you can also calculate the weighted rating of the books the same way IMDB ranks their top films.
Interactive Data Table.
If you’re in Google Collab, you can run this command to get an interactive table and do more exploration like sorting each column, filtering, etc.
data_table.DataTable(df_books, include_index=False, num_rows_per_page=5)
You now have 100 books in a data frame.
You can decide to export it into a CSV file like this.
df_books.to_csv('100_DS_books.csv', index=False)
A cool thing you can do with this CSV file is you can iterate over the book_link
column, and download all 100 books to your computer.
Thanks for reading!
That’s all for this article, and I hope you got a glimpse of the power of web scraping!
Here are some resources / tutorials on web scraping
Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!