40 Useful Pandas Snippets

Data Science

Pandas snippets that come in handy in data analysis work

Pandas is a versatile and powerful library for data science. It’s like a swiss army knife for data science because it provides so many useful functions for different tasks for dealing with data.

To be effective with this tool, you need to know some tricks of the trade. In this article, I detail 40 useful pandas snippets that I use regularly.

For those with an understanding of the Pandas library, the following snippets might be useful.

For those who are unfamiliar with Pandas, the following might help you better understand the library by working through some examples.

The dataset used throughout this article is available on Kaggle.

Code for this article → Deepnote

· Reading data
  ∘ 1. Filter columns
  ∘ 2. Parse dates on read
  ∘ 3. Specify Data Types
  ∘ 4. Set index
  ∘ 5. No. of rows to read
  ∘ 6. Skip rows
  ∘ 7. Specify NA values
  ∘ 8. Setting boolean values
  ∘ 9. Read from multiple files
  ∘ 10. Copy and Paste into Data Frames
  ∘ 11. Read tables from PDF files
· Exploratory Data Analysis (EDA)
  ∘ 12. EDA cheat
· Data Types (dtypes)
  ∘ 13. Filter columns by dtype
  ∘ 14. Infer dtype
  ∘ 15. Downcasting
  ∘ 16. Manual conversion
  ∘ 17. Convert all at once
· Column operations
  ∘ 18. Renaming columns
  ∘ 19. Add suffix and prefix
  ∘ 20. Create new columns (Mutate in dplyr terms)
  ∘ 21. Insert columns at specific positions
  ∘ 22. if-then-else
  ∘ 23. Dropping columns
· String operations
  ∘ 24. Column names
  ∘ 24. Contains
  ∘ 25. findall
· Missing values
  ∘ 26. Checking
  ∘ 27. Dealing with missing values
· Date operations
  ∘ 28. Get X hours/days/weeks from today / ago
  ∘ 29. Filter between two dates
  ∘ 30. Filter by day/month/year
· Styling data frames
  ∘ 31. Number format
  ∘ 32. Let there be colors
· Misc
  ∘ 33. Get the id of max and min in a column
  ∘ 34. Apply function to data frame
  ∘ 35. Randomly shuffle data
  ∘ 36. Percent change
  ∘ 37. Assign rank
  ∘ 38. Check memory usage of data frame
  ∘ 39. Explode list values to multiple rows
  ∘ 40. Convert smaller categories to “Others”

Reading data

read_csv can do much more than just reading in your data.

pd.read_csv("data.csv")

Here’s a taste of it. (More in the docs)

1. Filter columns

Only need a couple of columns from the dataset? Use usecols

pd.read_csv("data.csv", usecols=["date", "price"])

2. Parse dates on read

No need to do pd.to_datetime anymore, parse it on read!

pd.read_csv("data.csv", parse_dates=["date"])

3. Specify Data Types

Setting category data types at read can save a ton of memory for data frames!

pd.read_csv("data.csv", dtype={"house_type": "category"})

4. Set index

Setting indexes are especially useful for time series data.

pd.read_csv("data.csv", index_col="date")

5. No. of rows to read

Don’t want to read in a dataset with millions of rows before having a peek at it? Use nrows!

pd.read_csv("data.csv", nrows=100)

6. Skip rows

Does your data set have rows with faulty data? Skip them!

pd.read_csv("data.csv", skiprows=[1, 5])  # skips line 1 and 5
pd.read_csv("data.csv", skiprows=100)  # skips the first 100 lines
pd.read_csv("data.csv", skiprows=lambda x: x > 0 and np.random.rand() > 0.1) # skip 90% of the rows

7. Specify NA values

If your data has values that are supposed to be NA, i.e. values such as ? set it at read so you won’t have to convert it later.

pd.read_csv("data.csv", na_values=["?"])

8. Setting boolean values

Have a boolean column that’s in the form of Yes and No? Tell pandas about it!

pd.read_csv("data.csv", true_values=["yes"], false_values=["no"])

9. Read from multiple files

Is your data in multiple files? Read them all in with glob!

import glob
import os

files = glob.glob("file_*.csv")

result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

10. Copy and Paste into Data Frames

Looking at some data on Excel but don’t want to download it? Copy it! Pandas can read from your clipboard.

df = pd.read_clipboard()

11. Read tables from PDF files

Need to read in tables from PDf files, tabula-py has your back!

# %pip install tabula-py

from tabula import read_pdf
# Read pdf into list of DataFrame
df = read_pdf('test.pdf', pages='all')

Exploratory Data Analysis (EDA)

12. EDA cheat

Want to visualize your dataset but don’t want to write code for plots? With pandas-profiling, you can do it with just one line of code.

# %pip install pandas-profiling

import pandas_profiling

df = pd.read_csv("data.csv")
profile = df.profile_report(title="Pandas Profiling Report")
profile.to_file(output_file="output.html")

Data Types (dtypes)

Here’s a list of dtypes for pandas

13. Filter columns by dtype

# selecting
df.select_dtypes(include="number")
df.select_dtypes(include=["category", "datetime"])

# exluding
df.select_dtypes(exclude="object")

14. Infer dtype

Are your numeric columns read in as objects? Let pandas do the work in converting them!

df.infer_objects().dtypes

15. Downcasting

Pandas’ to_numeric has a nifty feature to downcast the type, allowing you to reduce the data frame’s size.

pd.to_numeric(df.numeric_col, downcast="integer") # smallest signed int dtype
pd.to_numeric(df.numeric_col, downcast="float")  # smallest float dtype

16. Manual conversion

If there are NaN values in the data, errors="coerce" can help prevent those nasty errors. At the same time, you can fill those NA values with reasonable values using .fillna

# apply to whole data frame
df = df.apply(pd.to_numeric, errors="coerce")

# apply to specific columns
pd.to_numeric(df.numeric_column, errors="coerce")

# filling NA values with zero
pd.to_numeric(df.numeric_column, errors="coerce").fillna(0)

17. Convert all at once

df = df.astype(
    {
        "date": "datetime64[ns]",
        "price": "int",
        "is_weekend": "bool",
        "status": "category",
    }
)

Column operations

18. Renaming columns

df = df.rename({"PRICE": "price", "Date (mm/dd/yyyy)": "date"}, axis=1)

19. Add suffix and prefix

df.add_prefix("pre_")
df.add_suffix("_suf")

20. Create new columns (Mutate in dplyr terms)

# create new column of Fahrenheit values from Celcius
df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)

21. Insert columns at specific positions

random_col = np.random.randint(10, size=len(df))
df.insert(3, 'random_col', random_col) # inserts at third column

22. if-then-else

df["logic"] = np.where(df["price"] > 5, "high", "low")

23. Dropping columns

df.drop('col1', axis=1, inplace=True)
df = df.drop(['col1','col2'], axis=1)
s = df.pop('col')
del df['col']
df.drop(df.columns[0], inplace=True)

String operations

24. Column names

# on column names
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')

24. Contains

df['name'].str.contains("John")

df['phone_num'].str.contains('...-...-....', regex=True)  # regex

df['email'].str.contains('gmail')
Hosted on
Deepnote

25. findall

pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

df['email'].str.findall(pattern, flags=re.IGNORECASE)

Missing values

26. Checking

def missing_vals(df):
    """prints out columns with perc of missing values"""
    missing = [
        (df.columns[idx], perc)
        for idx, perc in enumerate(df.isna().mean() * 100)
        if perc > 0
    ]

    if len(missing) == 0:
        return "no missing values"
        

    # sort desc by perc
    missing.sort(key=lambda x: x[1], reverse=True)

    print(f"There are a total of {len(missing)} variables with missing values\n")

    for tup in missing:
        print(str.ljust(f"{tup[0]:<20} => {round(tup[1], 3)}%", 1))


missing_vals(df)



There are a total of 16 variables with missing values

PoolQC               => 100.0%
Alley                => 94.0%
MiscFeature          => 91.0%
Fence                => 77.0%
FireplaceQu          => 54.0%
LotFrontage          => 14.0%
GarageType           => 6.0%
GarageYrBlt          => 6.0%
GarageFinish         => 6.0%
GarageQual           => 6.0%
GarageCond           => 6.0%
BsmtQual             => 3.0%
BsmtCond             => 3.0%
BsmtExposure         => 3.0%
BsmtFinType1         => 3.0%
BsmtFinType2         => 3.0%

27. Dealing with missing values

# drop 
df.dropna(axis=0)
df.dropna(axis=1)

# impute
df.fillna(0)
df.fillna(method="ffill")
df.fillna(method='bfill')

# replace
df.replace( -999, np.nan)
df.replace("?", np.nan)

# interpolate
ts.interpolate() # time series
df.interpolate() # fill all consecutive values forward
df.interpolate(limit=1) # fill one consecutive value forward
df.interpolate(limit=1, limit_direction="backward")
df.interpolate(limit_direction="both")

Date operations

28. Get X hours/days/weeks from today / ago

# from today
date.today() + datetime.timedelta(hours=30)
date.today() + datetime.timedelta(days=30)
date.today() + datetime.timedelta(weeks=30)

# ago
date.today() - datetime.timedelta(days=365)

datetime.date(2021, 4, 18)

29. Filter between two dates

df[(df["Date"] > "2015-01-01") & (df["Date"] < "2017-01-01")]

30. Filter by day/month/year

# filter by single day
df[df["Date"].dt.strftime("%Y-%m-%d") == "2017-03-01"]

# filter by single month
df[df["Date"].dt.strftime("%m") == "12"]

# filter by single year
df[df["Date"].dt.strftime("%Y") == "2017"]

Styling data frames

31. Number format

format_dict = {
    "Date": "{:%d/%m/%y}",
    "Open": "${:.2f}",
    "Close": "${:.2f}",
    "Volume": "{:,}",
}

df.style.format(format_dict)

        Date 	        Open 	        Close 	        Volume
1681 	03/03/17 	$250.74 	$251.57 	2,919,400
1686 	10/03/17 	$246.21 	$243.69 	3,057,000
1667 	10/02/17 	$269.79 	$269.23 	3,619,700
1642 	05/01/17 	$226.42 	$226.75 	5,911,700
1645 	10/01/17 	$232.00 	$229.87 	3,660,000
1655 	25/01/17 	$257.31 	$254.47 	5,142,600
1643 	06/01/17 	$226.93 	$229.01 	5,527,900
1666 	09/02/17 	$266.25 	$269.20 	7,820,200
1662 	03/02/17 	$251.91 	$251.33 	2,186,700
1679 	01/03/17 	$254.18 	$250.02 	4,800,300

32. Let there be colors

(
    df.style.format(format_dict)
    .hide_index()
    .highlight_min(["Open"], color="red")
    .highlight_max(["Open"], color="green")
    .background_gradient(subset="Close", cmap="Greens")
    .bar('Volume', color='lightblue', align='zero')
    .set_caption('Tesla Stock Prices in 2017')
)

Date 	        Open 	        Close 	        Volume
03/03/17 	$250.74 	$251.57 	2,919,400
10/03/17 	$246.21 	$243.69 	3,057,000
10/02/17 	$269.79 	$269.23 	3,619,700
05/01/17 	$226.42 	$226.75 	5,911,700
10/01/17 	$232.00 	$229.87 	3,660,000
25/01/17 	$257.31 	$254.47 	5,142,600
06/01/17 	$226.93 	$229.01 	5,527,900
09/02/17 	$266.25 	$269.20 	7,820,200
03/02/17 	$251.91 	$251.33 	2,186,700
01/03/17 	$254.18 	$250.02 	4,800,300
Tesla Stock Prices in 2017

More styling options in the docs

Misc

33. Get the id of max and min in a column

df['col'].idxmin()
df['col'].idxmax()

34. Apply function to data frame

df.applymap(lambda x: np.log(x))

35. Randomly shuffle data

df.sample(frac=1, random_state=7).reset_index(drop=True)

36. Percent change

Useful for time series data

ex: price of BTC over 3 days [30000, 33000, 31000] -> [NaN, 0.1, -0.06]

df['col_name'].pct_change()

37. Assign rank

df['rank'] = df['column_to_rank'].rank()

38. Check memory usage of data frame

df.memory_usage().sum() / (1024**2) #converting to MB

39. Explode list values to multiple rows

df.explode("col_name").reset_index(drop=True)

40. Convert smaller categories to “Others”

subclass = df.MSSubClass
subclass.value_counts()

20     45
60     18
50      9
120     5
30      4
90      4
190     3
70      3
160     2
45      2
80      2
180     1
75      1
85      1
Name: MSSubClass, dtype: int64

top_five = subclass.value_counts().nlargest(5).index
mssubclass_new = subclass.where(subclass.isin(top_five), other="Other")
mssubclass_new.value_counts()

20       45
Other    19
60       18
50        9
120       5
30        4
Name: MSSubClass, dtype: int64

Hope you found these code snippets useful in your own data work!

If you want more, check out these resources below

Thanks for reading!

Like this article? Here are some articles you may enjoy:

Data Analysis in 10 Easy Steps — The key steps in the process of turning data into insights
Top Machine Learning Frameworks in 2021 — The Top ML frameworks used by 25,000 data scientists in 2021.
Top 5 Machine Learning Algorithms Explained — Exploring the most popular data science methods and their applications.

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!

Website
Twitter
LinkedIn
Instagram
Facebook
YouTube (Subscribe for AI talks)