Data Science
Data Cleaning with Python
A guide to data cleaning using the Airbnb NY data set
It is widely known that data scientists spend a lot of their time cleaning data, you even might have heard that data scientists spend 80% of their time finding, cleaning, and reorganizing data, and only 20% analyzing it and producing insights.
What is Data Cleaning?
According the Wikipedia, Data Cleaning is:
the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
From this definition, we can say that the aspects of clean data are:
- Complete
- Correct
- Accurate
- Relevant
Why is it important?
Data cleaning is a crucial step in the data science pipeline as the insights and results you produce is only as good as the data you have. As the old adage goes — garbage in, garbage out.
Using dirty data to produce analysis will result in erroneous predictions that engenders bad decisions and dangerous outcomes. Not only that, most machine learning algorithms only work when your data is properly cleaned and fit for modeling.
How to clean data
With all that said, in this article, I will be sharing a few common tasks you will perform in data cleaning, using the New York City Airbnb Open Data from Kaggle as an example. Note that I used this dataset and it might differ from the one from Kaggle.
In particular, this article will be centered around these 3 topics below
- Missing data
- Inconsistent data/Irrelevant features
- Outliers
I hope you find this article practical and useful in your learning, or your work.
You can find the notebook with the code I use in this article hosted on Jovian.ai here.
I recommend you to download that notebook and experiment with it yourself so you get hands-on with this article.
Now let’s dive in!
Importing Libraries
First we start by importing the necessary libraries for data cleaning.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
Load the data
Then we load the data. For my case, I loaded it from a csv file hosted on Github, but you can upload the csv file and import that data using pd.read_csv()
Notice that I copy the original dataset using .copy()
. This is for a data cleaning example later on in the article. It’s also good practice to copy the data set when you want to test something out so you won’t have to re-run the entire notebook if you make a mistake somewhere.
airbnb_url = 'https://raw.githubusercontent.com/ManarOmar/New-York-Airbnb-2019/master/AB_NYC_2019.csv'
airbnb_ori = pd.read_csv(airbnb_url)
airbnb = airbnb_ori.copy()
Exploratory Data Analysis
Running head
gives us a peek of our dataset. Although this doesn’t really tell us much about our data besides showing us how the data looks like.
airbnb.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM….NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
Data info
Calling info()
on our dataset tells us tons of information about our data frame like the shape (rows, columns), the data type of our features, and the memory usage.
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 48895 non-null int64
1 name 48879 non-null object
2 host_id 48895 non-null int64
3 host_name 48874 non-null object
4 neighbourhood_group 48895 non-null object
5 neighbourhood 48895 non-null object
6 latitude 48895 non-null float64
7 longitude 48895 non-null float64
8 room_type 48895 non-null object
9 price 48895 non-null int64
10 minimum_nights 48895 non-null int64
11 number_of_reviews 48895 non-null int64
12 last_review 38843 non-null object
13 reviews_per_month 38843 non-null float64
14 calculated_host_listings_count 48895 non-null int64
15 availability_365 48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
Data Type
Separating our features into numerical and categorical early on is useful and here is a helper function that does that. This is accomplished by using the select_dtypes()
function that columns with the ‘object’
data type as categorical and the rest as numerical.
cat_df = airbnb.select_dtypes(include=['object'])
num_df = airbnb.select_dtypes(exclude=['object'])
def printColumnTypes(non_numeric_df, numeric_df):
'''separates non-numeric and numeric columns'''
print("Non-Numeric columns:")
for col in non_numeric_df:
print(f"{col}")
print("")
print("Numeric columns:")
for col in numeric_df:
print(f"{col}")
printColumnTypes(cat_df, num_df)
Non-Numeric columns:
name
host_name
neighbourhood_group
neighbourhood
room_type
last_review
Numeric columns:
id
host_id
latitude
longitude
price
minimum_nights
number_of_reviews
reviews_per_month
calculated_host_listings_count
availability_365
Now we have some idea of our data, we can move on to cleaning it by first checking for missing values.
Missing Data
Missing data is common in all kinds of data and is tricky to deal with. Most machine learning techniques do not work with missing values and it has to be addressed early on.
The two common methods to deal with missing values are
- dropping the rows / columns
- imputing them based on other observations i.e. the mean or median
There are a few problems to these approaches.
For example, by dropping rows/columns, you’re essentially losing information that might be useful for prediction
On the other hand, imputing values will introduce bias to your data but it still might better than removing your features.
Here is a great analogy for this dilemma in this article by Elite Data Science.
Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.
To properly deal with missing data, it depends on the context of the problem you’re trying to solve, and it’s recommended to experiment with different methods to effectively utilize the data you have in hand.
Here are a couple useful helper functions you can use to visualize missing values.
Columns with missing values
This function only prints out columns with missing values, and shows its amount
If you want to see missing values for all columns, use this command:
df.isnull().sum()
def missing_cols(df):
'''prints out columns with its amount of missing values'''
total = 0
for col in df.columns:
missing_vals = df[col].isnull().sum()
total += missing_vals
if missing_vals != 0:
print(f"{col} => {df[col].isnull().sum()}")
if total == 0:
print("no missing values left")
missing_cols(airbnb)
name => 16
host_name => 21
last_review => 10052
reviews_per_month => 10052
Percentage missing
This gives you the percentage of missing values in each of the columns. Knowing the percentage can be useful in determining whether you should drop the column.
The percentage is calculated using the mean
def perc_missing(df):
'''prints out columns with missing values with its %'''
for col in df.columns:
pct = df[col].isna().mean() * 100
if (pct != 0):
print('{} => {}%'.format(col, round(pct, 2)))
perc_missing(airbnb)
name => 0.03%
host_name => 0.04%
last_review => 20.56%
reviews_per_month => 20.56%
Heatmap of missing values
Heatmaps are also useful to visualize your missing values, in particular at which point of the data do missing values exists.
Now that you know which columns have missing values, it’s time to deal with them.
Technique to deal with missing values
The common techniques I will be sharing are:
1. Drop the feature
2. Drop the row
3. Impute the missing value
4. Replace it
Dropping feature
Dropping feature usually isn’t recommended because you’re losing information. But if you’re sure that the column isn’t important, or simply has too many missing values, you can choose to drop them. For example, for this dataset, the host_name
column was removed for ethical reasons, and id
was removed because it was was unnecessary.
To drop features, use drop
and set axis
to 1 and inplace
to true. Axis is 1 because we want to drop columns (0 means row), and inplace
is True because you’re transforming it directly on your dataset.
# Drop unnecessary columns that are not important
colsToDrop = ['id','host_name','last_review']
airbnb.drop(colsToDrop, axis=1, inplace=True)
missing_cols(airbnb)
name => 16
reviews_per_month => 10052
Dropping the row
If you want to remove rows, you do so using dropna
. I’m not going to do that because there are no missing values in price.
# remove rows with missing values in price
airbnb['price'].dropna(inplace=True)
Imputing
For imputing, there are 3 main techniques shown below.
fillna
— filling in null values based on given value (mean, median, mode, or specified value)bfill
/ffill
— stands for backward fill and forward fill (filling in missing values based on the value after or before the column.)Simple Imputer
— Sk-learn’s built-in function that imputes missing values (commonly used alongside a pipeline when building ML models)
Below you can find examples of applying these methods to the price
column if it had missing values.
# imputing price with mean
price_mean_value = round(airbnb['price'].mean(), 2)
airbnb['price'].fillna(price_mean_value, inplace=True)
# imputing price with median
price_median_value = round(airbnb['price'].median(), 2)
airbnb['price'].fillna(price_median_value, inplace=True)
# imputing with bfill or ffill
airbnb['price'].bfill(inplace=True)
airbnb['price'].ffill(inplace=True)
# imputing with SimpleImputor from the sklearn library
from sklearn.impute import SimpleImputer
# define the imputer
imr = SimpleImputer(missing_values=np.nan, strategy='mean') # or median
airbnb[['price']] = imr.fit_transform(airbnb[['price']])
# use strategy = 'most_frequent' for categorical data
Replace
To replace values, the fillna
function is also used.
You define the value you want to replace in the key, and the substitute in the value — {column_name: replacement_for_NA}
Here are examples for replacing values in the columns reviews_per_month
and name
# replace null values in reviews_per_month with 0
airbnb.fillna({'reviews_per_month':0}, inplace=True)
missing_cols(airbnb)
name => 16
# replace null values in name with 'None'
airbnb.fillna({'name':'None'}, inplace=True)
missing_cols(airbnb)
no missing values left
Hooray we have no missing values left!
Now let’s move on to dealing with inconsistent or irrelevant features.
Inconsistent data/Irrelevant features
Inconsistent data refers to things like spelling errors in your data, column names that are not relevant to the data, the wrong data type, etc.
Here are a couple examples for dealing with these issues.
Remove rows based on regex
Let’s say you want to remove rows that contain a certain word. For my example, I chose the word noisy/Noisy as my target, and I used the function str.contains()
to find the indexes that contain those rows.
Then, using the drop
function, and setting axis to index, I can supply the indexes I have and drop those rows.
Printing out the number of rows, you can see it reduced by three.
# example: remove rows that contain the target word
target = '[Nn]oisy'
noisy_airbnb = airbnb[airbnb['name'].str.contains(target, regex=True)]
# show rows that contains the word noisy
print(noisy_airbnb['name'])
# get the index that contains the word noisy
index_to_drop = noisy_airbnb['name'].index
# print(index_to_drop)
19733 Noisy room next to Prospect Park
23857 Small Homey Studio in a Big Noisy City
35343 My wife is not used to the noisy environment. How
Name: name, dtype: object
# drop rows based on index
airbnb.drop(index_to_drop, axis='index', inplace=True)
print(len(airbnb_ori))
print(len(airbnb))
48895
48892
Spelling errors in categorical data
Sometimes your categorical data might have spelling errors or different capitalization that can mess up your categorization.
I will be using the neighbourhood_group
column as an example.
airbnb['neighbourhood_group'].value_counts()
Manhattan 21661
Brooklyn 20104
Queens 5666
Bronx 1091
Staten Island 373
Name: neighbourhood_group, dtype: int64
You can see the different types of neighborhoods are already well categorized. But what if it wasn’t?
To simulate a scenario where some of the data had capitalization or spelling issues, I sampled 2 rows from the data, and replaced them with the wrong spelling.
You can see now how the categorization is messed up. “Manhattan” and “manhatann” refer to the same thing, but they aren’t in the same category because of capitalization. Same goes for “brookln” due to spelling issues.
We can fix this by using the replace
function in pandas. We first give the values that are wrong, then supply the right ones. Notice the values have to match each other in the list, i.e. “manhatann” → “Manhattan”.
random_index = airbnb.sample(2, random_state = 10).index
# airbnb['neighbourhood_group'].loc[random_index]
## we randomly selected Manhattan and Brooklyn
wrong_spelling = ['manhatann', 'brookln']
# replace them with the wrong spelling
airbnb.loc[random_index,'neighbourhood_group'] = wrong_spelling
airbnb['neighbourhood_group'].value_counts()
Manhattan 21660
Brooklyn 20103
Queens 5666
Bronx 1091
Staten Island 373
manhatann 1
brookln 1
Name: neighbourhood_group, dtype: int64
airbnb['neighbourhood_group'].replace(['manhatann', 'brookln'],
['Manhattan', 'Brooklyn'], inplace=True)
airbnb['neighbourhood_group'].value_counts()
Manhattan 21661
Brooklyn 20104
Queens 5666
Bronx 1091
Staten Island 373
Name: neighbourhood_group, dtype: int64
Now our groups are fixed again!
Renaming columns
There are cases where you want to rename your columns as well.
You can do this by using a dictionary, setting the key as the original column name, and the value as the new column name.
Then using the rename
function we give our dictionary and voila, the columns names have changed.
new_names = {'name':'listing_name', 'latitude':'lat', 'longitude':'long'}
airbnb.rename(columns=new_names, inplace=True)
airbnb.head()
listing_name | host_id | neighbourhood_group | neighbourhood | lat | long | room_type | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Clean & quiet apt home by the park | 2787 | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 0.21 | 6 | 365 |
1 | Skylit Midtown Castle | 2845 | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 0.38 | 2 | 355 |
2 | THE VILLAGE OF HARLEM….NEW YORK ! | 4632 | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | 0.00 | 1 | 365 |
3 | Cozy Entire Floor of Brownstone | 4869 | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 4.64 | 1 | 194 |
4 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 0.10 | 1 |
Converting to DateTime
If you have data that should be a datetime
object, but are strings, you can use the pd.to_datetime
, and pass it the format that represents your data.
Just like that, the column has converted into a datatime
data type.
airbnb_ori['last_review'] = pd.to_datetime(airbnb_ori['last_review'], format='%Y-%m-%d')
airbnb_ori['last_review'].dtype.type
numpy.datetime64
Duplicates
There are cases where your rows have duplicate values, this could’ve happened due to some mishaps in your data collection.
To find out if you have duplicated values, call duplicated().any()
on your data frame, and if it’s true
, use the drop_duplicates
function
You can also specify columns where you want to remove duplicate values like below.
airbnb.duplicated().any()
## if true
# airbnb.drop_duplicates()
## if you want to drop duplicates at specific column
# airbnb.drop('col_name', axis=1, inplace=True).drop_duplicates()
False
Change data type to reduce memory
Changing data type is common if you want to reduce memory usage.
To do so, you can use the astype(‘dtype’)
function where you specify the dtype
you want.
In my example, I changed the data type for the host_id column from int64
to int32
Observe the memory before changing the data type
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48892 entries, 0 to 48894
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_name 48892 non-null object
1 host_id 48892 non-null int64
2 neighbourhood_group 48892 non-null object
3 neighbourhood 48892 non-null object
4 lat 48892 non-null float64
5 long 48892 non-null float64
6 room_type 48892 non-null object
7 price 48892 non-null int64
8 minimum_nights 48892 non-null int64
9 number_of_reviews 48892 non-null int64
10 reviews_per_month 48892 non-null float64
11 calculated_host_listings_count 48892 non-null int64
12 availability_365 48892 non-null int64
dtypes: float64(3), int64(6), object(4)
memory usage: 6.5+ MB
And after.
airbnb['host_id'] = airbnb['host_id'].astype('int32')
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48892 entries, 0 to 48894
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_name 48892 non-null object
1 host_id 48892 non-null int32
2 neighbourhood_group 48892 non-null object
3 neighbourhood 48892 non-null object
4 lat 48892 non-null float64
5 long 48892 non-null float64
6 room_type 48892 non-null object
7 price 48892 non-null int64
8 minimum_nights 48892 non-null int64
9 number_of_reviews 48892 non-null int64
10 reviews_per_month 48892 non-null float64
11 calculated_host_listings_count 48892 non-null int64
12 availability_365 48892 non-null int64
dtypes: float64(3), int32(1), int64(5), object(4)
memory usage: 6.3+ MB
You can see the memory reduced from 6.5+ to 6.3+ MB.
Here is more information on changing data types.
That’s all I have on inconsistent data, let’s move on to outliers.
Outliers
Outliers can be dangerous as they can skew your model and give you predictions that are biased and erroneous.
The best way to find outliers is to use the describe
function and look at information such as maximum and mean.
airbnb['price'].describe()
count 48892.000000
mean 152.714023
std 240.156106
min 0.000000
25% 69.000000
50% 106.000000
75% 175.000000
max 10000.000000
Name: price, dtype: float64
You can also plot a histogram and look at the distribution of your data.
In this histogram, you can see that most of the data is around 0 to 5000.
A boxplot is also useful in detecting outliers.
As you can see, the price
column has multiple data points that are outliers (above of the maximum in the boxplot)
For categorical data, you can plot a bar chart to see whether a particular category to view the count of the categories.
Outliers in categorical data is tricky, because you have to determine whether it’s appropriate to call it an outlier based on context.
Some outliers are more obvious. Let’s say there’s an experiment done where 1000 people choose between a glass of water and a glass of milk. If the final result is 1 person who chose a glass of water, and 999 people choosing a glass of milk, that 1 person can be considered an outlier.
However, in some cases, outliers depend on context. In my example, you see that Manhattan and Brooklyn has significantly more data than Staten Island. This doesn’t count as an outlier, since Manhattan and Brooklyn has a higher housing density as compared to Staten Island.
Dealing with outliers
Dealing with outliers is similar to removing missing values, the only difference is the way you find outliers.
To categorize numerical values as outliers, there are statistical techniques like using the standard deviation and the Interquartile range. You can refer to this article for code examples to do that.
For categorical values, if they have very low frequency (like Staten island in the example above), it still may become a problem for your model. Depending on context and nature of the data, you can choose to group them into one category, for example “Others”. This way your model will be less biased and you’re not losing any information.
The importance of documenting
For all the data cleaning tasks you see above, it’s important to document your process in data cleaning, i.e. what tools you used, what functions you created, and your approach.
This is so that others to understand what you did, and it can eventually become the culture or process for how your organization deal with dirty data.
Learn by doing
The best way to learn is to practice it hands-on, this definitely applies to data cleaning because different datasets require different methods to clean.
That said, you’re in luck! We have a new competition released — the Viral Tweets Prediction Challenge.
This is the perfect opportunity for you to practice data cleaning as the dataset given are tweets and will be an interesting challenge to tackle. This competition ends on July 6, 2021 so sign up now and test your skills today!
That’s all for this article, thanks for reading and I hope you learned something new from it!
Resources
Here are more articles on the topic of data cleaning
Join our ✨ discord server and hang out with other data scientists around the world!
Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!