Data Science
Predict what species is a bird based on genetics and location using Machine Learning
Like birds? Like Machine Learning?
You’ll love this challenge!
Problem Statement
Scientists have determined that a known species of bird should be divided into 3 distinct and separate species. These species are endemic to a particular region of the country and their populations must be tracked and estimated with as much precision as possible.
As such, a non-profit conservation society has taken up the task. They need to be able to log which species they have encountered based on the characteristics that their field officers observe in the wild.
Using certain genetic traits and location data, can you predict the species of bird that has been observed?
This is a beginner-level practice competition and your goal is to predict the bird species based on attributes or location.”
You now have a clear goal.
The goal 🥅
Predict the bird species (A, B, or C) based on attributes or location
Let’s now look at the data
The data 💾
Get the data by registering for this data science competition.
📂 train ├── training_target.csv ├── training_set.csv └── solution_format.csv
📂 test └── test_set.csv
The data has been conveniently split into train and test datasets.
In each train and test, you’re given bird data for locations 1 to 3.
Here’s a look at the first five rows of training_set.csv
bill_depth | bill_length | wing_length | location | mass | sex | ID |
14.3 | 48.2 | 210 | loc_2 | 4600 | 0 | 284 |
14.4 | 48.4 | 203 | loc_2 | 4625 | 0 | 101 |
18.4 | NA | 200 | loc_3 | 3400 | 0 | 400 |
14.98211382 | 47.50487805 | NA | NA | 4800 | 0 | 98 |
18.98211382 | 38.25930705 | 217.1869919 | loc_3 | 5200 | 0 | 103 |
The training_set
and the training_target
can be joined with the ‘id’
column.
Below is a data dictionary for the given columns
species : animal species (A, B, C) bill_length : bill length (mm) bill_depth : bill depth (mm) wing_length : wing length (mm) mass : body mass (g) location : island type (Location 1, 2, 3) sex : animal sex (0: Male; 1: Female; NA: Unknown)
Then, looking at solution_format.csv
ID species
2 A
5 C
7 C
8 B
9 C
Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.
Code for this article → Deepnote
Load Libraries
Next, we load up some essential libraries for visualizations and machine learning.
import pandas as pd
# plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.dpi'] = 100
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(style="whitegrid")
%matplotlib inline
# ml
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
Missing data helper function
def missing_vals(df):
"""prints out columns with perc of missing values"""
missing = [
(df.columns[idx], perc)
for idx, perc in enumerate(df.isna().mean() * 100)
if perc > 0
]
if len(missing) == 0:
return "no missing values"
# sort desc by perc
missing.sort(key=lambda x: x[1], reverse=True)
print(f"There are a total of {len(missing)} variables with missing values\n")
for tup in missing:
print(str.ljust(f"{tup[0]:<20} => {round(tup[1], 3)}%", 1))
Load the data
train = pd.read_csv("dataset/training_set/training_set.csv")
labels = pd.read_csv("dataset/training_set/training_target.csv")
# join target variable to training set
train = train.merge(labels, on="ID")
test = pd.read_csv("dataset/test_set/test_set.csv")
First, we load the train and test data using the read_csv
function.
We also merge training_set.csv
(containing the features) with `training_target.csv` (containing the target variable) and form the train data.
target_cols = "species"
num_cols = ["bill_depth", "bill_length", "wing_length", "mass"]
cat_cols = ["location", "sex"]
all_cols = num_cols + cat_cols + [target_cols]
train = train[all_cols]
Here I manually saved the column names, which are numerical and categorical, and also saved the target column.
This allows me to easily reference columns that I want later on
Exploratory Data Analysis
It’s time for the fun part, visualizing the data.
train.head()
bill_depth | bill_length | wing_length | location | mass | sex | ID | species |
14.3 | 48.2 | 210 | loc_2 | 4600 | 0 | 284 | C |
14.4 | 48.4 | 203 | loc_2 | 4625 | 0 | 101 | C |
18.4 | NaN | 200 | loc_3 | 3400 | 0 | 400 | B |
14.982114 | 47.504878 | NaN | NaN | 4800 | 0 | 98 | C |
18.982114 | 38.259307 | 217.186992 | loc_3 | 5200 | 0 | 103 | C |
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 0 to 434
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bill_depth 434 non-null float64
1 bill_length 295 non-null float64
2 wing_length 298 non-null float64
3 mass 433 non-null float64
4 location 405 non-null object
5 sex 379 non-null float64
6 species 435 non-null object
dtypes: float64(5), object(2)
memory usage: 27.2+ KB
From the info
function, there seem to be missing values, and we can see that location and sex should be categorical, so we have to do some data type conversion later on.
Numerical columns
train[num_cols].hist(figsize=(20, 14));
Plotting the histograms of the numerical variables, we see that
- bill_depth peaks around 15 and 19
- bill length peaks around 39 and 47
- wing length peaks around 190 and 216
- mass is right-skewed
Categorical columns
Let’s first visualize our target class.
We see location and species seemingly for their respective locations and species (loc2 & species C, loc3 & species A).
We also see there are slightly more female (1) birds than the male counterpart.
Based on the species plot, it appears we have in our hands an imbalanced class as species B
is considerably less than species A
and C
train.species.value_counts()
C 182
A 160
B 93
Name: species, dtype: int64
Why is this a problem?
The model will be biased towards classes with a larger amount of samples.
This happens because the classifier has more information on classes with more samples, so it learns how to predict those classes better while it remains weak in the smaller classes.
In our case, the species A
and C
will be predicted more than other classes.
Here’s a great article on how to deal with this issue.
Missing values
Using the helper function, there seems to be a substantial amount of missing data for bill_length
and wing_length
missing_vals(train)
There are a total of 6 variables with missing values
bill_length => 32.184%
wing_length => 31.494%
sex => 12.874%
location => 6.897%
mass => 0.46%
bill_depth => 0.23%
Let’s also use a heatmap to visualize the missing data for that column.
plt.figure(figsize=(10, 6))
sns.heatmap(train.isnull(), yticklabels=False, cmap='viridis', cbar=False);
Impute categorical values
Let’s first see how many missing variables are in our categorical variables.
train.sex.value_counts(dropna=False)
1.0 195
0.0 184
NaN 56
Name: sex, dtype: int64
Let’s use the simple imputer to deal with them by replacing them with the most frequent value.
cat_imp = SimpleImputer(strategy="most_frequent")
train[cat_cols] = cat_imp.fit_transform(train[cat_cols])
As you can see, by the most_frequent
strategy, the missing values were imputed with 1.0, which was the most frequent.
train.sex.value_counts(dropna=False)
1.0 251
0.0 184
Name: sex, dtype: int64
Impute Numerical columns
num_imp = SimpleImputer(strategy="median")
train[num_cols] = num_imp.fit_transform(train[num_cols])
missing_vals(train)
'no missing values'
Feature Preprocessing & Engineering
We’ll have to convert the categorical features to a numerical format, including the target variable.
Let’s use scikit-learn’s Label Encoder to do that.
Here’s an example of using LabelEncoder()
the label column
le = LabelEncoder()
le.fit(train['species'])
le_name_map = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_map
{'A': 0, 'B': 1, 'C': 2}
By fitting it first, we can see what the mapping looks like.
Using fit_transform
directly converts it for us
train['species'] = le.fit_transform(train['species'])
For other columns with string variables (non-numeric), we also do the same encoding
for col in cat_cols:
if train[col].dtype == "object":
train[col] = le.fit_transform(train[col])
We also convert categorical features into the pd.Categorical
dtype
# Convert cat_features to pd.Categorical dtype
for col in cat_cols:
train[col] = pd.Categorical(train[col])
Here’s the current data type of the variables.
train.dtypes
bill_depth float64
bill_length float64
wing_length float64
mass float64
location category
sex category
species int64
dtype: object
Now we create some additional features by dividing some variables with another to form ratios.
We don’t know if they would help increase the predictive power of the model, but it doesn’t hurt to try.
train['b_depth_length_ratio'] = train['bill_depth'] / train['bill_length']
train['b_length_depth_ratio'] = train['bill_length'] / train['bill_depth']
train['w_length_mass_ratio'] = train['wing_length'] / train['mass']
Here’s what the train set looks like so far
train.head()
Building the model
Train test split
Now it’s time to build the model, we first split it into X (features) and y (target variable), and then split it into training and evaluation set.
Training is where we train the model, evaluation is where we test the model before fitting it to the test set.
X, y = train.drop(["species"], axis=1), train[["species"]].values.flatten()
We use train_test_split
to split our data into the training and evaluation sets.
X_train, X_eval, y_train, y_eval = train_test_split( X, y, test_size=0.25, random_state=0)
Decision Tree Classifier
For this article, we choose a simple baseline mode, the DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
Once we fit the training set, we can predict on the evaluation data.
dtree_pred = dtree_model.predict(X_eval)
Model Performance
Let’s see how our simple decision tree classifier did.
A 99% accuracy can be meaningless for an imbalanced dataset, so we need more suitable metrics like precision, recall, and a confusion matrix.
Confusion matrix
Let’s create a confusion matrix for our model predictions.
First, we need to get the class names and the labels that the label encoder gave so our plot can show the label names.
We then plot a non-normalized and normalized confusion matrix.
# save the target variable classes
class_names = le_name_map.keys()
titles_options = [
("Confusion matrix, without normalization", None),
("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
fig, ax = plt.subplots(figsize=(8, 8))
disp = ConfusionMatrixDisplay.from_estimator(
dtree_model,
X_eval,
y_eval,
display_labels=class_names,
cmap=plt.cm.Blues,
normalize=normalize,
ax = ax
)
disp.ax_.set_title(title)
disp.ax_.grid(False)
print(title)
print(disp.confusion_matrix)
Confusion matrix, without normalization
[[40 0 0]
[ 5 12 0]
[12 1 39]]
Normalized confusion matrix
[[1. 0. 0. ]
[0.29411765 0.70588235 0. ]
[0.23076923 0.01923077 0.75 ]]
The confusion matrix shows us that it is predicting more classes A and C, which is not surprising since we had more samples.
It also shows the model is predicting more A classes when it should be B/C.
Classification Report
A classification report measures the quality of predictions from a classification algorithm.
It tells us how many predictions are right/wrong
More specifically, it uses True Positives, False Positives, True Negatives, and False Negatives to compute the metrics of precision, recall, and f1-score
For a detailed calculation of these metrics, check out Multi-Class Metrics Made Simple, Part II: the F1-score by Boaz Shmueli
Intuitively, precision is the ability of the classifier not to label as positive (correct) a sample that is negative (wrong), and recall is the ability of the classifier to find all the positive (correct) samples.
From the docs,
"macro"
simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class."weighted"
accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.
There is no single best metric — it depends on your application. The application, and the real-life costs associated with the different types of errors, will dictate which metric to use.
Feature Importance
Let’s also plot the feature importance to see which features matter more.
feature_imp = pd.DataFrame(sorted(zip(dtree_model.feature_importances_,X.columns)), columns=['Value','Feature'])
plt.figure(figsize=(20, 15))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features')
plt.tight_layout()
# plt.savefig('lightgbm_fimp.png')
From the feature importance, it seems mass
is the best at predicting species, second is bill_length
.
Other variables seem to have zero importance in the classifier.
We see how the feature importance is used in this visualization of our decision tree classifier.
In root node, if the mass is lower than around 4600, it then checks for bill_length
, else it checks for bill_depth
, and then at the leaf is where it predicts the classes.
Predict on test data
First we perform the same preprocessing + feature generations
le = LabelEncoder()
cat_imp = SimpleImputer(strategy="most_frequent")
num_imp = SimpleImputer(strategy="median")
test[cat_cols] = cat_imp.fit_transform(test[cat_cols])
test[num_cols] = num_imp.fit_transform(test[num_cols])
for col in cat_cols:
if test[col].dtype == "object":
test[col] = le.fit_transform(test[col])
# Convert cat_features to pd.Categorical dtype
for col in cat_cols:
test[col] = pd.Categorical(test[col])
# save ID column
test_id = test["ID"]
all_cols.remove('species')
test = test[all_cols]
test['b_depth_length_ratio'] = test['bill_depth'] / test['bill_length']
test['b_length_depth_ratio'] = test['bill_length'] / test['bill_depth']
test['w_length_mass_ratio'] = test['wing_length'] / test['mass']
Then we can use our model to make the prediction, and concatenate the ID column to form the solution file.
test_preds = dtree_model.predict(test)
submission_df = pd.concat([test_id, pd.DataFrame(test_preds, columns=['species'])], axis=1)
submission_df.head()
Notice the species value are numerical, we have to convert it back to the string values. with the label encoder with fit earlier, we can do so.
le_name_map
{'A': 0, 'B': 1, 'C': 2}
inv_map = {v: k for k, v in le_name_map.items()}
inv_map
{0: 'H', 1: 'L', 2: 'N'}
submission_df['species'] = submission_df['species'].map(inv_map)
submission_df.head()
Save the prediction file.
submission_df.to_csv('solution.csv', index=False)
Next steps
The base model won’t be enough to make a good prediction; here are some next steps to improve upon the given approach.
- More feature preprocessing and engineering
- Use cross-validation to have a better measure of the performance.
- Test out other algorithms like KNN, SVM, XGBoost, Catboost, etc.
- Join the bitgrit discord server to discuss the challenge with other data scientists
Awesome Kaggle notebooks
Here’s 3 notebooks as a reference on how to up your game in this challenge
- Data Science for tabular data: Advanced Techniques
- Credit Fraud — Dealing with Imbalanced Datasets
- House prices: Lasso, XGBoost, and a detailed EDA
Thanks for reading!
Liked this article? Here are three articles you may like:
- Predicting Rain with Machine Learning
- Using Data Science to Predict Viral Tweets
- 40 Useful Pandas Snippets. Pandas snippets that come in handy in data analysis work
Data License
Data are available by CC-0 license in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data.
Like my writing? Join Medium with my referral link, and you’ll be supporting me directly 🤗