Data Cleaning with Python

Data Science

Data Cleaning with Python

A guide to data cleaning using the Airbnb NY data set

Photo by Filiberto Santillán on Unsplash

It is widely known that data scientists spend a lot of their time cleaning data, you even might have heard that data scientists spend 80% of their time finding, cleaning, and reorganizing data, and only 20% analyzing it and producing insights.

What is Data Cleaning?

According the Wikipedia, Data Cleaning is:

the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

From this definition, we can say that the aspects of clean data are:

Complete
Correct
Accurate
Relevant

Why is it important?

Data cleaning is a crucial step in the data science pipeline as the insights and results you produce is only as good as the data you have. As the old adage goes — garbage in, garbage out.

Using dirty data to produce analysis will result in erroneous predictions that engenders bad decisions and dangerous outcomes. Not only that, most machine learning algorithms only work when your data is properly cleaned and fit for modeling.

How to clean data

With all that said, in this article, I will be sharing a few common tasks you will perform in data cleaning, using the New York City Airbnb Open Data from Kaggle as an example. Note that I used this dataset and it might differ from the one from Kaggle.

In particular, this article will be centered around these 3 topics below

Missing data
Inconsistent data/Irrelevant features
Outliers

I hope you find this article practical and useful in your learning, or your work.

You can find the notebook with the code I use in this article hosted on Jovian.ai here.

I recommend you to download that notebook and experiment with it yourself so you get hands-on with this article.

Now let’s dive in!

Importing Libraries

First we start by importing the necessary libraries for data cleaning.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

%matplotlib inline

Load the data

Then we load the data. For my case, I loaded it from a csv file hosted on Github, but you can upload the csv file and import that data using pd.read_csv()

Notice that I copy the original dataset using .copy(). This is for a data cleaning example later on in the article. It’s also good practice to copy the data set when you want to test something out so you won’t have to re-run the entire notebook if you make a mistake somewhere.

airbnb_url = 'https://raw.githubusercontent.com/ManarOmar/New-York-Airbnb-2019/master/AB_NYC_2019.csv'

airbnb_ori = pd.read_csv(airbnb_url)
airbnb = airbnb_ori.copy()

Exploratory Data Analysis

Running head gives us a peek of our dataset. Although this doesn’t really tell us much about our data besides showing us how the data looks like.

airbnb.head()

id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM….NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

Data info

Calling info() on our dataset tells us tons of information about our data frame like the shape (rows, columns), the data type of our features, and the memory usage.

airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB

Data Type

Separating our features into numerical and categorical early on is useful and here is a helper function that does that. This is accomplished by using the select_dtypes() function that columns with the ‘object’ data type as categorical and the rest as numerical.

cat_df = airbnb.select_dtypes(include=['object'])
num_df = airbnb.select_dtypes(exclude=['object'])

def printColumnTypes(non_numeric_df, numeric_df):
    '''separates non-numeric and numeric columns'''
    print("Non-Numeric columns:")
    for col in non_numeric_df:
        print(f"{col}")
    print("")
    print("Numeric columns:")
    for col in numeric_df:
        print(f"{col}")
        
printColumnTypes(cat_df, num_df)

Non-Numeric columns:
name
host_name
neighbourhood_group
neighbourhood
room_type
last_review

Numeric columns:
id
host_id
latitude
longitude
price
minimum_nights
number_of_reviews
reviews_per_month
calculated_host_listings_count
availability_365

Now we have some idea of our data, we can move on to cleaning it by first checking for missing values.

Missing Data

Missing data is common in all kinds of data and is tricky to deal with. Most machine learning techniques do not work with missing values and it has to be addressed early on.

The two common methods to deal with missing values are

dropping the rows / columns
imputing them based on other observations i.e. the mean or median

There are a few problems to these approaches.

For example, by dropping rows/columns, you’re essentially losing information that might be useful for prediction

On the other hand, imputing values will introduce bias to your data but it still might better than removing your features.

Here is a great analogy for this dilemma in this article by Elite Data Science.

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

To properly deal with missing data, it depends on the context of the problem you’re trying to solve, and it’s recommended to experiment with different methods to effectively utilize the data you have in hand.

Here are a couple useful helper functions you can use to visualize missing values.

Columns with missing values

This function only prints out columns with missing values, and shows its amount

If you want to see missing values for all columns, use this command:

df.isnull().sum()

def missing_cols(df):
    '''prints out columns with its amount of missing values'''
    total = 0
    for col in df.columns:
        missing_vals = df[col].isnull().sum()
        total += missing_vals
        if missing_vals != 0:
            print(f"{col} => {df[col].isnull().sum()}")
    
    if total == 0:
        print("no missing values left")
            
missing_cols(airbnb)

name => 16
host_name => 21
last_review => 10052
reviews_per_month => 10052

Percentage missing

This gives you the percentage of missing values in each of the columns. Knowing the percentage can be useful in determining whether you should drop the column.

The percentage is calculated using the mean

def perc_missing(df):
    '''prints out columns with missing values with its %'''
    for col in df.columns:
        pct = df[col].isna().mean() * 100
        if (pct != 0):
            print('{} => {}%'.format(col, round(pct, 2)))
    
perc_missing(airbnb)

name => 0.03%
host_name => 0.04%
last_review => 20.56%
reviews_per_month => 20.56%

Heatmap of missing values

Heatmaps are also useful to visualize your missing values, in particular at which point of the data do missing values exists.

Now that you know which columns have missing values, it’s time to deal with them.

Technique to deal with missing values

The common techniques I will be sharing are:

1. Drop the feature
2. Drop the row
3. Impute the missing value
4. Replace it

Dropping feature

Dropping feature usually isn’t recommended because you’re losing information. But if you’re sure that the column isn’t important, or simply has too many missing values, you can choose to drop them. For example, for this dataset, the host_name column was removed for ethical reasons, and id was removed because it was was unnecessary.

To drop features, use drop and set axis to 1 and inplace to true. Axis is 1 because we want to drop columns (0 means row), and inplace is True because you’re transforming it directly on your dataset.

# Drop unnecessary columns that are not important
colsToDrop = ['id','host_name','last_review']

airbnb.drop(colsToDrop, axis=1, inplace=True)

missing_cols(airbnb)

name => 16
reviews_per_month => 10052

Dropping the row

If you want to remove rows, you do so using dropna . I’m not going to do that because there are no missing values in price.

# remove rows with missing values in price
airbnb['price'].dropna(inplace=True)

Imputing

For imputing, there are 3 main techniques shown below.

fillna — filling in null values based on given value (mean, median, mode, or specified value)
bfill / ffill — stands for backward fill and forward fill (filling in missing values based on the value after or before the column.)
Simple Imputer — Sk-learn’s built-in function that imputes missing values (commonly used alongside a pipeline when building ML models)

Below you can find examples of applying these methods to the price column if it had missing values.

# imputing price with mean
price_mean_value = round(airbnb['price'].mean(), 2)
airbnb['price'].fillna(price_mean_value, inplace=True)

# imputing price with median
price_median_value = round(airbnb['price'].median(), 2)
airbnb['price'].fillna(price_median_value, inplace=True)

# imputing with bfill or ffill
airbnb['price'].bfill(inplace=True)
airbnb['price'].ffill(inplace=True)

# imputing with SimpleImputor from the sklearn library
from sklearn.impute import SimpleImputer
# define the imputer
imr = SimpleImputer(missing_values=np.nan, strategy='mean') # or median

airbnb[['price']] = imr.fit_transform(airbnb[['price']])

# use strategy = 'most_frequent' for categorical data

Replace

To replace values, the fillna function is also used.

You define the value you want to replace in the key, and the substitute in the value — {column_name: replacement_for_NA}

Here are examples for replacing values in the columns reviews_per_month and name

# replace null values in reviews_per_month with 0 
airbnb.fillna({'reviews_per_month':0}, inplace=True)

missing_cols(airbnb)

name => 16

# replace null values in name with 'None'
airbnb.fillna({'name':'None'}, inplace=True)

missing_cols(airbnb)

no missing values left

Hooray we have no missing values left!

Now let’s move on to dealing with inconsistent or irrelevant features.

Inconsistent data/Irrelevant features

Inconsistent data refers to things like spelling errors in your data, column names that are not relevant to the data, the wrong data type, etc.

Here are a couple examples for dealing with these issues.

Remove rows based on regex

Let’s say you want to remove rows that contain a certain word. For my example, I chose the word noisy/Noisy as my target, and I used the function str.contains() to find the indexes that contain those rows.

Then, using the drop function, and setting axis to index, I can supply the indexes I have and drop those rows.

Printing out the number of rows, you can see it reduced by three.

# example: remove rows that contain the target word
target = '[Nn]oisy'

noisy_airbnb = airbnb[airbnb['name'].str.contains(target, regex=True)]

# show rows that contains the word noisy
print(noisy_airbnb['name'])

# get the index that contains the word noisy
index_to_drop = noisy_airbnb['name'].index

# print(index_to_drop)

19733                     Noisy room next to Prospect Park
23857               Small Homey Studio in a Big Noisy City
35343    My wife is not used to the noisy environment. How
Name: name, dtype: object

# drop rows based on index
airbnb.drop(index_to_drop, axis='index', inplace=True)

print(len(airbnb_ori))
print(len(airbnb))

48895
48892

Spelling errors in categorical data

Sometimes your categorical data might have spelling errors or different capitalization that can mess up your categorization.

I will be using the neighbourhood_group column as an example.

airbnb['neighbourhood_group'].value_counts()

Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64

You can see the different types of neighborhoods are already well categorized. But what if it wasn’t?

To simulate a scenario where some of the data had capitalization or spelling issues, I sampled 2 rows from the data, and replaced them with the wrong spelling.

You can see now how the categorization is messed up. “Manhattan” and “manhatann” refer to the same thing, but they aren’t in the same category because of capitalization. Same goes for “brookln” due to spelling issues.

We can fix this by using the replace function in pandas. We first give the values that are wrong, then supply the right ones. Notice the values have to match each other in the list, i.e. “manhatann” → “Manhattan”.

random_index = airbnb.sample(2, random_state = 10).index

# airbnb['neighbourhood_group'].loc[random_index]
## we randomly selected Manhattan and Brooklyn

wrong_spelling = ['manhatann', 'brookln']

# replace them with the wrong spelling
airbnb.loc[random_index,'neighbourhood_group'] = wrong_spelling
airbnb['neighbourhood_group'].value_counts()

Manhattan        21660
Brooklyn         20103
Queens            5666
Bronx             1091
Staten Island      373
manhatann            1
brookln              1
Name: neighbourhood_group, dtype: int64

airbnb['neighbourhood_group'].replace(['manhatann', 'brookln'],
                             ['Manhattan', 'Brooklyn'], inplace=True)
airbnb['neighbourhood_group'].value_counts()

Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64

Now our groups are fixed again!

Renaming columns

There are cases where you want to rename your columns as well.

You can do this by using a dictionary, setting the key as the original column name, and the value as the new column name.

Then using the rename function we give our dictionary and voila, the columns names have changed.

new_names = {'name':'listing_name', 'latitude':'lat', 'longitude':'long'}

airbnb.rename(columns=new_names, inplace=True)
airbnb.head()

listing_name	host_id	neighbourhood_group	neighbourhood	lat	long	room_type	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365
0	Clean & quiet apt home by the park	2787	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	0.21	6	365
1	Skylit Midtown Castle	2845	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	0.38	2	355
2	THE VILLAGE OF HARLEM….NEW YORK !	4632	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	0.00	1	365
3	Cozy Entire Floor of Brownstone	4869	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	4.64	1	194
4	Entire Apt: Spacious Studio/Loft by central park	7192	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	0.10	1

Converting to DateTime

If you have data that should be a datetime object, but are strings, you can use the pd.to_datetime, and pass it the format that represents your data.

Just like that, the column has converted into a datatime data type.

airbnb_ori['last_review'] = pd.to_datetime(airbnb_ori['last_review'], format='%Y-%m-%d')
airbnb_ori['last_review'].dtype.type

numpy.datetime64

Duplicates

There are cases where your rows have duplicate values, this could’ve happened due to some mishaps in your data collection.

To find out if you have duplicated values, call duplicated().any() on your data frame, and if it’s true, use the drop_duplicates function

You can also specify columns where you want to remove duplicate values like below.

airbnb.duplicated().any()

## if true
# airbnb.drop_duplicates()

## if you want to drop duplicates at specific column
# airbnb.drop('col_name', axis=1, inplace=True).drop_duplicates()

False

Change data type to reduce memory

Changing data type is common if you want to reduce memory usage.

To do so, you can use the astype(‘dtype’) function where you specify the dtype you want.

In my example, I changed the data type for the host_id column from int64 to int32

Observe the memory before changing the data type

airbnb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48892 entries, 0 to 48894
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   listing_name                    48892 non-null  object 
 1   host_id                         48892 non-null  int64  
 2   neighbourhood_group             48892 non-null  object 
 3   neighbourhood                   48892 non-null  object 
 4   lat                             48892 non-null  float64
 5   long                            48892 non-null  float64
 6   room_type                       48892 non-null  object 
 7   price                           48892 non-null  int64  
 8   minimum_nights                  48892 non-null  int64  
 9   number_of_reviews               48892 non-null  int64  
 10  reviews_per_month               48892 non-null  float64
 11  calculated_host_listings_count  48892 non-null  int64  
 12  availability_365                48892 non-null  int64  
dtypes: float64(3), int64(6), object(4)
memory usage: 6.5+ MB

And after.

airbnb['host_id'] = airbnb['host_id'].astype('int32')
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48892 entries, 0 to 48894
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   listing_name                    48892 non-null  object 
 1   host_id                         48892 non-null  int32  
 2   neighbourhood_group             48892 non-null  object 
 3   neighbourhood                   48892 non-null  object 
 4   lat                             48892 non-null  float64
 5   long                            48892 non-null  float64
 6   room_type                       48892 non-null  object 
 7   price                           48892 non-null  int64  
 8   minimum_nights                  48892 non-null  int64  
 9   number_of_reviews               48892 non-null  int64  
 10  reviews_per_month               48892 non-null  float64
 11  calculated_host_listings_count  48892 non-null  int64  
 12  availability_365                48892 non-null  int64  
dtypes: float64(3), int32(1), int64(5), object(4)
memory usage: 6.3+ MB

You can see the memory reduced from 6.5+ to 6.3+ MB.

Here is more information on changing data types.

That’s all I have on inconsistent data, let’s move on to outliers.

Outliers

Outliers can be dangerous as they can skew your model and give you predictions that are biased and erroneous.

The best way to find outliers is to use the describe function and look at information such as maximum and mean.

airbnb['price'].describe()

count    48892.000000
mean       152.714023
std        240.156106
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

You can also plot a histogram and look at the distribution of your data.

In this histogram, you can see that most of the data is around 0 to 5000.

A boxplot is also useful in detecting outliers.

As you can see, the price column has multiple data points that are outliers (above of the maximum in the boxplot)

For categorical data, you can plot a bar chart to see whether a particular category to view the count of the categories.

Outliers in categorical data is tricky, because you have to determine whether it’s appropriate to call it an outlier based on context.

Some outliers are more obvious. Let’s say there’s an experiment done where 1000 people choose between a glass of water and a glass of milk. If the final result is 1 person who chose a glass of water, and 999 people choosing a glass of milk, that 1 person can be considered an outlier.

However, in some cases, outliers depend on context. In my example, you see that Manhattan and Brooklyn has significantly more data than Staten Island. This doesn’t count as an outlier, since Manhattan and Brooklyn has a higher housing density as compared to Staten Island.

Dealing with outliers

Dealing with outliers is similar to removing missing values, the only difference is the way you find outliers.

To categorize numerical values as outliers, there are statistical techniques like using the standard deviation and the Interquartile range. You can refer to this article for code examples to do that.

For categorical values, if they have very low frequency (like Staten island in the example above), it still may become a problem for your model. Depending on context and nature of the data, you can choose to group them into one category, for example “Others”. This way your model will be less biased and you’re not losing any information.

The importance of documenting

For all the data cleaning tasks you see above, it’s important to document your process in data cleaning, i.e. what tools you used, what functions you created, and your approach.

This is so that others to understand what you did, and it can eventually become the culture or process for how your organization deal with dirty data.

Learn by doing

The best way to learn is to practice it hands-on, this definitely applies to data cleaning because different datasets require different methods to clean.

That said, you’re in luck! We have a new competition released — the Viral Tweets Prediction Challenge.

This is the perfect opportunity for you to practice data cleaning as the dataset given are tweets and will be an interesting challenge to tackle. This competition ends on July 6, 2021 so sign up now and test your skills today!

That’s all for this article, thanks for reading and I hope you learned something new from it!

Resources

Here are more articles on the topic of data cleaning

Data cleaning: The benefits and steps to creating and using clean data
When using data, most people agree that your insights and analysis are only as good as the data you are using…www.tableau.com

Data Cleaning Steps and Techniques – Data Science Primer
These data cleaning steps will turn your dataset into a gold mine of value. In this guide, we teach you simple…elitedatascience.com

Join our ✨ discord server and hang out with other data scientists around the world!

Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!

DataScience Programming