Back to blog

Bird Species Classification with Machine Learning

Data Science

Predict what species is a bird based on genetics and location using Machine Learning

Photo by Shannon Potter on Unsplash

Like birds? Like Machine Learning?

You’ll love this challenge!

Problem Statement

Scientists have determined that a known species of bird should be divided into 3 distinct and separate species. These species are endemic to a particular region of the country and their populations must be tracked and estimated with as much precision as possible.

As such, a non-profit conservation society has taken up the task. They need to be able to log which species they have encountered based on the characteristics that their field officers observe in the wild.

Using certain genetic traits and location data, can you predict the species of bird that has been observed?

This is a beginner-level practice competition and your goal is to predict the bird species based on attributes or location.”

Source

You now have a clear goal.

The goal 🥅

Predict the bird species (A, B, or C) based on attributes or location

Let’s now look at the data

The data 💾

Get the data by registering for this data science competition.

📂 train
├── training_target.csv
├── training_set.csv
└── solution_format.csv
📂 test
└── test_set.csv

The data has been conveniently split into train and test datasets.

In each train and test, you’re given bird data for locations 1 to 3.

Here’s a look at the first five rows of training_set.csv

bill_depthbill_lengthwing_lengthlocationmasssexID
14.348.2210loc_246000284
14.448.4203loc_246250101
18.4NA200loc_334000400
14.9821138247.50487805NANA4800098
18.9821138238.25930705217.1869919loc_352000103

The training_set and the training_target can be joined with the ‘id’ column.

Below is a data dictionary for the given columns

species     : animal species (A, B, C)
bill_length : bill length (mm)
bill_depth  : bill depth (mm)
wing_length : wing length (mm)
mass        : body mass (g)
location    : island type (Location 1, 2, 3)
sex         : animal sex (0: Male; 1: Female; NA: Unknown)

Then, looking at solution_format.csv

ID 	species
2 	A
5 	C
7 	C
8 	B
9 	C

Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.

Code for this article → Deepnote

Load Libraries

Next, we load up some essential libraries for visualizations and machine learning.

import pandas as pd

# plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.dpi'] = 100
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(style="whitegrid")
%matplotlib inline

# ml
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

Missing data helper function

def missing_vals(df):
    """prints out columns with perc of missing values"""
    missing = [
        (df.columns[idx], perc)
        for idx, perc in enumerate(df.isna().mean() * 100)
        if perc > 0
    ]

    if len(missing) == 0:
        return "no missing values"
        

    # sort desc by perc
    missing.sort(key=lambda x: x[1], reverse=True)

    print(f"There are a total of {len(missing)} variables with missing values\n")

    for tup in missing:
        print(str.ljust(f"{tup[0]:<20} => {round(tup[1], 3)}%", 1))

Load the data

train = pd.read_csv("dataset/training_set/training_set.csv")
labels = pd.read_csv("dataset/training_set/training_target.csv")

# join target variable to training set
train = train.merge(labels, on="ID")

test = pd.read_csv("dataset/test_set/test_set.csv")

First, we load the train and test data using the read_csv function.

We also merge training_set.csv (containing the features) with `training_target.csv` (containing the target variable) and form the train data.

target_cols = "species"
num_cols = ["bill_depth", "bill_length", "wing_length", "mass"]
cat_cols = ["location", "sex"]
all_cols = num_cols + cat_cols + [target_cols]

train = train[all_cols]

Here I manually saved the column names, which are numerical and categorical, and also saved the target column.

This allows me to easily reference columns that I want later on

Exploratory Data Analysis

It’s time for the fun part, visualizing the data.

train.head()
bill_depthbill_lengthwing_lengthlocationmasssexIDspecies
14.348.2210loc_246000284C
14.448.4203loc_246250101C
18.4NaN200loc_334000400B
14.98211447.504878NaNNaN4800098C
18.98211438.259307217.186992loc_352000103C
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 0 to 434
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bill_depth   434 non-null    float64
 1   bill_length  295 non-null    float64
 2   wing_length  298 non-null    float64
 3   mass         433 non-null    float64
 4   location     405 non-null    object 
 5   sex          379 non-null    float64
 6   species      435 non-null    object 
dtypes: float64(5), object(2)
memory usage: 27.2+ KB

From the info function, there seem to be missing values, and we can see that location and sex should be categorical, so we have to do some data type conversion later on.

Numerical columns

train[num_cols].hist(figsize=(20, 14));

Plotting the histograms of the numerical variables, we see that

  • bill_depth peaks around 15 and 19
  • bill length peaks around 39 and 47
  • wing length peaks around 190 and 216
  • mass is right-skewed

Categorical columns

Let’s first visualize our target class.

We see location and species seemingly for their respective locations and species (loc2 & species C, loc3 & species A).

We also see there are slightly more female (1) birds than the male counterpart.

Based on the species plot, it appears we have in our hands an imbalanced class as species B is considerably less than species A and C

train.species.value_counts()

C    182
A    160
B     93
Name: species, dtype: int64

Why is this a problem?

The model will be biased towards classes with a larger amount of samples.

This happens because the classifier has more information on classes with more samples, so it learns how to predict those classes better while it remains weak in the smaller classes.

In our case, the species A and C will be predicted more than other classes.

Here’s a great article on how to deal with this issue.

Missing values

Using the helper function, there seems to be a substantial amount of missing data for bill_length and wing_length

missing_vals(train)
There are a total of 6 variables with missing values

bill_length          => 32.184%
wing_length          => 31.494%
sex                  => 12.874%
location             => 6.897%
mass                 => 0.46%
bill_depth           => 0.23%

Let’s also use a heatmap to visualize the missing data for that column.

plt.figure(figsize=(10, 6))
sns.heatmap(train.isnull(), yticklabels=False, cmap='viridis', cbar=False);

Impute categorical values

Let’s first see how many missing variables are in our categorical variables.

train.sex.value_counts(dropna=False)


1.0    195
0.0    184
NaN     56
Name: sex, dtype: int64

Let’s use the simple imputer to deal with them by replacing them with the most frequent value.

cat_imp = SimpleImputer(strategy="most_frequent")

train[cat_cols] = cat_imp.fit_transform(train[cat_cols])

As you can see, by the most_frequent strategy, the missing values were imputed with 1.0, which was the most frequent.

train.sex.value_counts(dropna=False)

1.0    251
0.0    184
Name: sex, dtype: int64

Impute Numerical columns

num_imp = SimpleImputer(strategy="median")

train[num_cols] = num_imp.fit_transform(train[num_cols])
missing_vals(train)

'no missing values'

Feature Preprocessing & Engineering

We’ll have to convert the categorical features to a numerical format, including the target variable.

Let’s use scikit-learn’s Label Encoder to do that.

Here’s an example of using LabelEncoder() the label column

le = LabelEncoder()
le.fit(train['species'])
le_name_map = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_map

{'A': 0, 'B': 1, 'C': 2}

By fitting it first, we can see what the mapping looks like.

Using fit_transform directly converts it for us

train['species'] = le.fit_transform(train['species'])

For other columns with string variables (non-numeric), we also do the same encoding

for col in cat_cols:
    if train[col].dtype == "object":
        train[col] = le.fit_transform(train[col])

We also convert categorical features into the pd.Categorical dtype

# Convert cat_features to pd.Categorical dtype
for col in cat_cols:
    train[col] = pd.Categorical(train[col])

Here’s the current data type of the variables.

train.dtypes

bill_depth      float64
bill_length     float64
wing_length     float64
mass            float64
location       category
sex            category
species           int64
dtype: object

Now we create some additional features by dividing some variables with another to form ratios.

We don’t know if they would help increase the predictive power of the model, but it doesn’t hurt to try.

train['b_depth_length_ratio'] = train['bill_depth'] / train['bill_length']
train['b_length_depth_ratio'] = train['bill_length'] / train['bill_depth']
train['w_length_mass_ratio'] = train['wing_length'] / train['mass']

Here’s what the train set looks like so far

train.head()

Building the model

Train test split

Now it’s time to build the model, we first split it into X (features) and y (target variable), and then split it into training and evaluation set.

Training is where we train the model, evaluation is where we test the model before fitting it to the test set.

X, y = train.drop(["species"], axis=1), train[["species"]].values.flatten()

We use train_test_split to split our data into the training and evaluation sets.

X_train, X_eval, y_train, y_eval = train_test_split( X, y, test_size=0.25, random_state=0)

Decision Tree Classifier

For this article, we choose a simple baseline mode, the DecisionTreeClassifier

dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)

Once we fit the training set, we can predict on the evaluation data.

dtree_pred = dtree_model.predict(X_eval)

Model Performance

Let’s see how our simple decision tree classifier did.

A 99% accuracy can be meaningless for an imbalanced dataset, so we need more suitable metrics like precision, recall, and a confusion matrix.

Confusion matrix

Let’s create a confusion matrix for our model predictions.

First, we need to get the class names and the labels that the label encoder gave so our plot can show the label names.

We then plot a non-normalized and normalized confusion matrix.

# save the target variable classes
class_names = le_name_map.keys()

titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
    fig, ax = plt.subplots(figsize=(8, 8))

    disp = ConfusionMatrixDisplay.from_estimator(
        dtree_model,
        X_eval,
        y_eval,
        display_labels=class_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
        ax = ax
    )
    disp.ax_.set_title(title)
    disp.ax_.grid(False)

    print(title)
    print(disp.confusion_matrix)
Confusion matrix, without normalization
[[40  0  0]
 [ 5 12  0]
 [12  1 39]]
Normalized confusion matrix
[[1.         0.         0.        ]
 [0.29411765 0.70588235 0.        ]
 [0.23076923 0.01923077 0.75      ]]

The confusion matrix shows us that it is predicting more classes A and C, which is not surprising since we had more samples.

It also shows the model is predicting more A classes when it should be B/C.

Classification Report

A classification report measures the quality of predictions from a classification algorithm.

It tells us how many predictions are right/wrong

More specifically, it uses True Positives, False Positives, True Negatives, and False Negatives to compute the metrics of precision, recall, and f1-score

For a detailed calculation of these metrics, check out Multi-Class Metrics Made Simple, Part II: the F1-score by Boaz Shmueli

Intuitively, precision is the ability of the classifier not to label as positive (correct) a sample that is negative (wrong), and recall is the ability of the classifier to find all the positive (correct) samples.

From the docs,

  • "macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.
  • "weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

There is no single best metric — it depends on your application. The application, and the real-life costs associated with the different types of errors, will dictate which metric to use.

Feature Importance

Let’s also plot the feature importance to see which features matter more.

feature_imp = pd.DataFrame(sorted(zip(dtree_model.feature_importances_,X.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 15))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features')
plt.tight_layout()
# plt.savefig('lightgbm_fimp.png')

From the feature importance, it seems mass is the best at predicting species, second is bill_length.

Other variables seem to have zero importance in the classifier.

We see how the feature importance is used in this visualization of our decision tree classifier.

In root node, if the mass is lower than around 4600, it then checks for bill_length, else it checks for bill_depth, and then at the leaf is where it predicts the classes.

Predict on test data

First we perform the same preprocessing + feature generations

le = LabelEncoder()

cat_imp = SimpleImputer(strategy="most_frequent")
num_imp = SimpleImputer(strategy="median")

test[cat_cols] = cat_imp.fit_transform(test[cat_cols])
test[num_cols] = num_imp.fit_transform(test[num_cols])

for col in cat_cols:
    if test[col].dtype == "object":
        test[col] = le.fit_transform(test[col])

# Convert cat_features to pd.Categorical dtype
for col in cat_cols:
    test[col] = pd.Categorical(test[col])

# save ID column
test_id = test["ID"]

all_cols.remove('species')
test = test[all_cols]

test['b_depth_length_ratio'] = test['bill_depth'] / test['bill_length']
test['b_length_depth_ratio'] = test['bill_length'] / test['bill_depth']
test['w_length_mass_ratio'] = test['wing_length'] / test['mass']

Then we can use our model to make the prediction, and concatenate the ID column to form the solution file.

test_preds = dtree_model.predict(test)
submission_df = pd.concat([test_id, pd.DataFrame(test_preds, columns=['species'])], axis=1)
submission_df.head()
Photo by Shannon Potter on Unsplash

Notice the species value are numerical, we have to convert it back to the string values. with the label encoder with fit earlier, we can do so.

le_name_map

{'A': 0, 'B': 1, 'C': 2}
inv_map = {v: k for k, v in le_name_map.items()}
inv_map

{0: 'H', 1: 'L', 2: 'N'}
submission_df['species'] = submission_df['species'].map(inv_map)  
submission_df.head()

Save the prediction file.

submission_df.to_csv('solution.csv', index=False)

Next steps

The base model won’t be enough to make a good prediction; here are some next steps to improve upon the given approach.

  1. More feature preprocessing and engineering
  2. Use cross-validation to have a better measure of the performance.
  3. Test out other algorithms like KNN, SVM, XGBoost, Catboost, etc.
  4. Join the bitgrit discord server to discuss the challenge with other data scientists

Awesome Kaggle notebooks

Here’s 3 notebooks as a reference on how to up your game in this challenge

  1. Data Science for tabular data: Advanced Techniques
  2. Credit Fraud — Dealing with Imbalanced Datasets
  3. House prices: Lasso, XGBoost, and a detailed EDA

Thanks for reading!

Liked this article? Here are three articles you may like:

Data License

Data are available by CC-0 license in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data.

Like my writing? Join Medium with my referral link, and you’ll be supporting me directly 🤗