Predicting Rain with Machine Learning

Will it rain tomorrow? Let’s build an AI model to answer that.

Looking for a fun challenge to grow your data science skills?

Here’s a real-world problem for you.

Problem Statement

The weather has a significant impact on the agricultural industry and because of that, being able to predict it helps farmers in their day-to-day decisions such as how to plan efficiently, minimize costs and maximize yields.

A major agricultural company needs you to help them maximize growth efficiency, save resources and optimize their production.

To achieve these things, the company needs to have an accurate weather prediction algorithm that will improve their decision-making on typical farming activities such as planting and irrigating.

Using historical weather information from their region, can you predict what the weather will be like in the next few days?

You now have a clear goal.

The goal 🥅

Predict the next day’s weather based on three labels.

N— No rain
L— Light rain
H — Heavy rain

Let’s now look at the data

The data 💾

📂 train
 ├── region_A_train.csv
 ├── region_B_train.csv
 ├── region_C_train.csv
 ├── region_D_train.csv
 ├── region_E_train.csv
 ├── solution_format.csv
 └── solution_train.csv📂 test
 ├── region_A_test.csv
 ├── region_B_test.csv
 ├── region_C_test.csv
 ├── region_D_test.csv
 └── region_E_test.csv

The data has been conveniently split into train and test datasets.

In each train and test, you’re given weather data which consists of anonymized locations named region A through region E, which are all neighboring regions.

Here’s a look at the first five rows of region_A_train.csv

date	avg.temp	max.temp	min.temp	precipitation	avg.wind.speed	max.wind.speed	max.wind.speed.dir	max.inst.wind.speed	max.inst.wind.speed.dir	min.atmos.pressure
229b70a3	3.3	10.2	-2.4	0	2.9	9.3	W	14.3	W	1015.1
3134f4ff	5.7	13.7	-2.9	0	3.6	10.7	W	15.8	W	1011.3
dbfaf910	13.8	20	9	0	5.3	9.4	SW	15.2	W	1004.2
3aea0cf0	11.4	19.3	5.8	0	4.2	10.1	SW	20.6	SW	1001.7
f0227f56	2.4	7.7	0.3	43.5	0.9	3.7	SW	5.7	SW	1003.5

The first thing you should notice is that the date column isn’t a date but was anonymized to be some random value.

There is a total of 10 features, which are composed of temperature, precipitation, wind speed, wind speed direction, and atmospheric pressure

Then, looking at solution_format.csv

date	label
a8c6911b	N
eebdce12	N
6fb420a6	L
3bf8b132	N

We can utilize the date column to join it with the training data and build a model.

Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.

Get the data by registering for this data science competition and follow along with this article!

Code for this article → Google collab or Deepnote.

Installing libraries and models

First, install the weapon of our choice: lightgbm

%pip install lightgbm

Load Libraries

Next, we load up some essential libraries for visualizations and machine learning.

# essentials
import numpy as np 
import pandas as pd 

# plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.dpi'] = 100
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(style="whitegrid")
%matplotlib inline

# ml
from sklearn.metrics import accuracy_score, recall_score, ConfusionMatrixDisplay, classification_report, auc, precision_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
import joblib

Load the data

Let’s read in all the data we have.

train_a = pd.read_csv("data/comp_datasets_train/region_A_train.csv")
train_b = pd.read_csv("data/comp_datasets_train/region_B_train.csv")
train_c = pd.read_csv("data/comp_datasets_train/region_C_train.csv")
train_d = pd.read_csv("data/comp_datasets_train/region_D_train.csv")
train_e = pd.read_csv("data/comp_datasets_train/region_E_train.csv")

test_a = pd.read_csv("data/comp_datasets_test/region_A_test.csv")
test_b = pd.read_csv("data/comp_datasets_test/region_B_test.csv")
test_c = pd.read_csv("data/comp_datasets_test/region_C_test.csv")
test_d = pd.read_csv("data/comp_datasets_test/region_D_test.csv")
test_e = pd.read_csv("data/comp_datasets_test/region_E_test.csv")

labels_df = pd.read_csv("data/comp_datasets_train/solution_train.csv")

Exploratory Data Analysis

It’s time for the fun part, visualizing the data.

train_a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566 entries, 0 to 565
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date                     566 non-null    object 
 1   avg.temp                 566 non-null    float64
 2   max.temp                 566 non-null    float64
 3   min.temp                 566 non-null    float64
 4   precipitation            566 non-null    float64
 5   avg.wind.speed           566 non-null    float64
 6   max.wind.speed           566 non-null    float64
 7   max.wind.speed.dir       566 non-null    object 
 8   max.inst.wind.speed      566 non-null    float64
 9   max.inst.wind.speed.dir  566 non-null    object 
 10  min.atmos.pressure       566 non-null    float64
dtypes: float64(8), object(3)
memory usage: 48.8+ KB

Distribution of target class

Let’s first visualize our target class.

sns.countplot(x = 'label', data = labels_df, palette="Set1");

It appears we have in our hands an imbalanced class, as the N label is dominating the rest of the classes.

Why is this a problem?

The model will be biased towards classes with a larger amount of samples.

This happens because the classifier has more information on classes with more samples, so it learns how to predict those classes better while it remains weak on the smaller classes.

In our case, the label N will be predicted more than other classes.

Let’s keep this in mind and move forward to more visualizations.

Plotting the features

We have ten features in each region.

train_all_lvls.columns[2:]

Index(['avg.temp', 'max.temp', 'min.temp', 'precipitation', 'avg.wind.speed',
       'max.wind.speed', 'max.wind.speed.dir', 'max.inst.wind.speed',
       'max.inst.wind.speed.dir', 'min.atmos.pressure'],
      dtype='object')

It’s time to cook up some plots.

Since all the regions are used to predict the next day’s weather, let’s see whether all the regions share similar patterns in the features and whether any outliers or anomalies exist.

fig, axes = plt.subplots(5,2,figsize=(14, 30), dpi=100)

for i, col_name in enumerate(train_all_lvls.columns[2:]):
    if train_all_lvls[col_name].dtype == 'O':
        train_all_lvls.groupby('region')[col_name].hist(ax=axes[i%5][i//5], alpha=0.6);
        axes[i%5][i//5].legend(["A", "B", "C", "D", "E"]);
    else:
        train_all_lvls.groupby('region')[col_name].plot(ax=axes[i%5][i//5], alpha=0.7);
        axes[i%5][i//5].legend();
    axes[i%5][i//5].set_title(f'{col_name}', fontsize=13);
    plt.subplots_adjust(hspace=0.45)

From the plots, we see that the patterns in the data are very similar except for regions C, D, E for min.wind.speed and avg.wind.speed which are on the lower scale.

Now that we’ve explored the data a little, we check for missing values in our data.

Missing values

Using my custom helper function, there seems to be a large amount of data missing for the min.atmos.pressure variable

missing_cols(train_all_lvls)
min.atmos.pressure => 2264 [80.0%]

Let’s also use a heatmap to visualize the missing data for that column.

plt.figure(figsize=(10, 6))
sns.heatmap(train_all_lvls.isnull(), yticklabels=False, cmap='viridis', cbar=False);

There are multiple ways we can deal with missing data.

Let’s first look at the distribution of the column with missing values.

train_all_lvls['min.atmos.pressure'].hist();

Since the distribution of the column is pretty normal, let’s impute the missing data with the mean.

mean_atmos = train_all_lvls['min.atmos.pressure'].mean()
train_all_lvls.fillna(mean_atmos, inplace=True)

We then do the same for the test data (full code in the notebook)

Feature Preprocessing & Engineering

Converting data types

Let’s first add the labels to our data.

train_all_lvls = train_all_lvls.merge(labels_df, on="date")

Then we take a look at the categorical columns for our dataset.

train_all_lvls.select_dtypes('object').columns

Index(['region', 'date', 'max.wind.speed.dir', 'max.inst.wind.speed.dir',
       'label'],
      dtype='object')

We’ll have to convert the categorical features, including the target variable to a numerical format.

Let’s use scikit-learn’s Label Encoder to do that.

Here’s an example of using LabelEncoder() on the label column

le = LabelEncoder()
le.fit(train_all_lvls['label'])
le_name_map = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_map

{'H': 0, 'L': 1, 'N': 2}

Creating new weather features

Since we’re given some weather features, there are interesting new features we can engineer.

One example is the Beaufort scale, which is “an empirical measure that relates wind speed to observed conditions at sea or on land.”

Below is a function to do our feature engineering.

BEAUFORT = [
    (0, 0, 0.3),
    (1, 0.3, 1.6),
    (2, 1.6, 3.4),
    (3, 3.4, 5.5),
    (4, 5.5, 8),
    (5, 8, 10.8),
    (6, 10.8, 13.9),
    (7, 13.9, 17.2),
    (8, 17.2, 20.8),
    (9, 20.8, 24.5),
    (10, 24.5, 28.5),
    (11, 28.5, 33),
    (12, 33, 200),
]


def feature_eng(df):
    le = LabelEncoder()
    
    cat_cols = df.select_dtypes("object").columns[2:]

    for col in cat_cols:
        if df[col].dtype == "object":
            df[col] = le.fit_transform(df[col])

    # wind speed is in meter/second
    # convert to knots to obtain beaufort scale
    for item in BEAUFORT:
        df.loc[
            (df["avg.wind.speed"] * 1.944 >= item[1]) & (df["avg.wind.speed"] * 1.944 < item[2]),
            "avg_beaufort_scale",
        ] = item[0]
        df.loc[
            (df["max.wind.speed"] * 1.944 >= item[1]) & (df["max.wind.speed"] * 1.944 < item[2]),
            "max_beaufort_scale",
        ] = item[0]

    df['avg_beaufort_scale'] = df['avg_beaufort_scale'].astype(int)
    df['max_beaufort_scale'] = df['max_beaufort_scale'].astype(int)

    return df

We can easily apply the feature engineering steps on the train and test dataset.

train = feature_eng(train_all_lvls)
test = feature_eng(test_all_lvls)

Prepare train data

Let’s look at our train data so far.

train.head()

region	date	avg.temp	max.temp	min.temp	precipitation	avg.wind.speed	max.wind.speed	max.wind.speed.dir	max.inst.wind.speed	max.inst.wind.speed.dir	min.atmos.pressure	label	avg_beaufort_scale	max_beaufort_scale
A	229b70a3	3.3	10.2	-2.4	0	2.9	9.3	7	14.3	7	1015.1	2	4	8
B	229b70a3	2.6	9.3	-3.3	0	2.8	6.3	3	11.4	3	1010.216254	2	3	6
C	229b70a3	1.4	9	-4.3	0	1	2.9	6	8.2	1	1010.216254	2	2	4
D	229b70a3	3.7	11.7	-2.6	0	1.5	5.4	6	9.4	4	1010.216254	2	2	5
E	229b70a3	-0.6	5.2	-5.9	0	1.5	3.8	3	9.8	7	1010.216254	2	2	4

It’s time to pivot our data into the longer format, which means instead of the region being a column, each feature will have its own region, such as avg.temp_A, avg.temp_B, until avg.temp_E for the rest of the features.

This way the data will be in the right shape to build the model.

We can use the pivot_table function in pandas to achieve that.

train = train.pivot_table(index=["date", "label"], columns="region")
train = pd.DataFrame(train.to_records())
train.head()

date	label	(‘avg.temp’, ‘A’)	(‘avg.temp’, ‘B’)	(‘avg.temp’, ‘C’)	(‘avg.temp’, ‘D’)	(‘avg.temp’, ‘E’)	(‘avg.wind.speed’, ‘A’)	(‘avg.wind.speed’, ‘B’)	(‘avg.wind.speed’, ‘C’)	…	(‘min.temp’, ‘A’)	(‘min.temp’, ‘B’)	(‘min.temp’, ‘C’)	(‘min.temp’, ‘D’)	(‘min.temp’, ‘E’)	(‘precipitation’, ‘A’)	(‘precipitation’, ‘B’)	(‘precipitation’, ‘C’)	(‘precipitation’, ‘D’)	(‘precipitation’, ‘E’)
00173aec	2	18.7	17.6	16.9	19.5	14.3	1.6	1.8	0.9	…	14.9	12.5	12.9	16.4	9	1	1.5	1.5	0	4.5
0083f291	1	13.1	12.6	12	13	10.7	1.4	1	0.7	…	11.3	11.1	10.6	12.1	8.5	50	46.5	49	45.5	63
014cfe7b	2	19.9	19	17.5	19.9	16.2	3.7	3.6	0.7	…	16.1	13.5	15.8	16.3	12.4	18	17	63.5	19	32
01947c8e	2	21.6	20.2	20.5	21.3	17.6	1.6	1.2	1.1	…	15.8	14.2	15.5	15.9	11.5	0	1	0	0	0
0258884d	2	15.2	13.9	13.9	15.8	11.1	2.5	2.3	1.2	…	10.3	5.6	8.9	10.4	5.3	0	0	0	0	0

The column names aren’t ideal, so I wrote a function to fix that.

def replace_all(text):
    d = { "('": "", "', '": "_", "')" : "",}
    for i, j in d.items():
        text = text.replace(i, j)
    return text

# ('avg.temp', 'A') -> avg.temp_A

test_str = "('avg.temp', 'A')"
replace_all(test_str)


'avg.temp_A'

train.columns = list(map(replace_all, train.columns))

Here’s what the train data looks like so far!

train

date	label	avg.temp_A	avg.temp_B	avg.temp_C	avg.temp_D	avg.temp_E	avg.wind.speed_A	avg.wind.speed_B	avg.wind.speed_C	…	min.temp_A	min.temp_B	min.temp_C	min.temp_D	min.temp_E	precipitation_A	precipitation_B	precipitation_C	precipitation_D	precipitation_E
00173aec	2	18.7	17.6	16.9	19.5	14.3	1.6	1.8	0.9	…	14.9	12.5	12.9	16.4	9	1	1.5	1.5	0	4.5
0083f291	1	13.1	12.6	12	13	10.7	1.4	1	0.7	…	11.3	11.1	10.6	12.1	8.5	50	46.5	49	45.5	63
014cfe7b	2	19.9	19	17.5	19.9	16.2	3.7	3.6	0.7	…	16.1	13.5	15.8	16.3	12.4	18	17	63.5	19	32
01947c8e	2	21.6	20.2	20.5	21.3	17.6	1.6	1.2	1.1	…	15.8	14.2	15.5	15.9	11.5	0	1	0	0	0
0258884d	2	15.2	13.9	13.9	15.8	11.1	2.5	2.3	1.2	…	10.3	5.6	8.9	10.4	5.3	0	0	0	0	0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
fe2a1385	1	2.9	1.6	1.4	3.4	0.4	1.2	0.9	0.7	…	-4	-5.9	-2.8	-3.4	-5.9	0	0	0	0	0
fe6dd99c	1	2.9	2.9	3.9	2.9	0.2	1.6	3.9	1	…	1.7	1.4	1.9	1	-1.2	2	1	2	4	8
ff88c3dd	1	9.8	8.9	9	10.2	6.3	1.9	1.5	1.4	…	1.5	-0.1	2	2.5	-1.9	0	0	0	0	0
ff929090	2	10.4	8.1	7.1	11	5.7	4.3	4.6	0.9	…	3.4	1.5	-0.6	3.2	-1.6	0	3.5	5.5	0	1.5
ffe3bd74	2	22.7	22.4	21.7	21.7	19.4	3.7	5.4	0.8	…	18.8	18.8	18.4	18.9	15.2	28	13.5	25.5	44	49

After we do the same to the test data, it’s time to split the train data.

X, y = train.drop(["label", "date"], axis=1), train[["label"]].values.flatten()

LightGBM likes it when the categorical features are given the categorical data type, so let’s do that.

# Extract categoricals and their indices
cat_feats = X.select_dtypes(include=['int64']).columns.to_list()
cat_idx = [X.columns.get_loc(col) for col in cat_feats]

# Convert cat_features to pd.Categorical dtype
for col in cat_feats:
    X[col] = pd.Categorical(X[col])

We can use train_test_split to split our data into the training and evaluation sets.

X_train, X_eval, y_train, y_eval = train_test_split(
    X, y, test_size=0.25, random_state=0)

It’s time to build the LightGBM model!

LightGBM

Let’s create a base lgb.LGBMClassifier for a simple prediction

We can then fit the training data.

clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train)

LGBMClassifier()

Then we call the predict function on the evaluation data

y_pred=clf.predict(X_eval.values)

Model Performance

Let’s see how a base LightGBM classifier did.

A 99% accuracy can be meaningless for an imbalanced dataset, so we need more suitable metrics like precision, recall, and a confusion matrix.

Confusion matrix

Let’s create a confusion matrix for our model predictions.

First, we need to get the class names and the labels that the label encoder gave so our plot can show the label names.

We then plot a non-normalized and normalized confusion matrix.

class_names = le_name_map.keys()

titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
    fig, ax = plt.subplots(figsize=(10, 10))

    disp = ConfusionMatrixDisplay.from_estimator(
        clf,
        X_eval,
        y_eval,
        display_labels=class_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
        ax = ax
    )
    disp.ax_.set_title(title)
    disp.ax_.grid(False)

    print(title)
    print(disp.confusion_matrix)

Confusion matrix, without normalization
[[ 3  4  2]
 [ 1 12 26]
 [ 0 19 75]]
Normalized confusion matrix
[[0.33333333 0.44444444 0.22222222]
 [0.02564103 0.30769231 0.66666667]
 [0.         0.20212766 0.79787234]]

As you can see from the shade of the plot, our model is predicting the label N much more than others.

With the values you see in the plot above, we can compute some metrics that tell us how well our model is doing.

Classification Report

A classification report measures the quality of predictions from a classification algorithm.

It answers the question of how many predictions are True and how many are False.

More specifically, it uses True Positives, False Positives, True Negatives, and False Negatives to compute the metrics of precision, recall, and f1-score

print(classification_report(y_pred, y_eval))

Let’s break down the output.

Precision — What percent of your predictions were correct?
Recall — What percent of the positive cases did you catch?
F1-score — The weighted average of Precision and Recall
support —The number of occurrences of each given class

If you still have a hard time understanding precision and recall, read this great explanation.

Calculating the metrics

Taking class 0, which is the label H , let’s calculate the precision, recall, and f1-score from the values in the confusion matrix

TP - True Positive | FP - False Positive
FN - False Negative | TN - True NegativePrecision = TP/(FP+TP) = 3/(3+4+2) = 3/9 = 0.33
Recall = TP/(TP+FN) = 3/(3+1+0) = 3/4 = 0.75F1-score = 2 * (Recall * Precision) / (Recall + Precision) 
         = 2 * 0.33 * 0.75 / (0.33 + 0.75)
         = 0.495 / 1.08
         = 0.458

Macro or weighted?

You might notice that there are three values for the overall F1-score 0.64, 0.53, 0.65 . Which value should you focus on?

In an imbalanced dataset where all classes are equally important, macro average is a good choice as it treats all classes equally.

If however, you want to assign greater weight to classes with more samples in the data, then the weighted average is preferred.

One more metric you can use is Precision-Recall, which is a useful measure of the success of prediction when the classes are very imbalanced.

Feature Importance

Let’s also plot the feature importance to see which features matter more.

feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,X.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 15))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features')
plt.tight_layout()
# plt.savefig('lightgbm_fimp.png')

From the plot above, the wind speed features and precipitation are the key features that are good predictors.

The Beaufort scale feature we created seems to have very low importance, so it might be better to remove them.

Interestingly, min.atmos.pressure in region A is the most important, wheres min.atmos.pressure in other regions are among the lowest in importance.

Saving the model

When you’re satisfied with your model, you can save it as a pickle file with joblib

joblib.dump(clf, 'lgb1.pkl')

Predict on test data

We have a simple model built. It’s time to predict it on the test data.

X = test.drop('date', axis=1)

# Convert cat_features to pd.Categorical dtype
for col in cat_feats:
    X[col] = pd.Categorical(X[col])

Let’s have a peek at the final submission file.

test_preds = clf.predict(X)
submission_df = pd.concat([test['date'], pd.DataFrame(test_preds, columns=['label'])], axis=1)
submission_df.head()

The labels are still encoded as numeric values; let’s bring the actual label names back.

Since we already have the dictionary of the mapping of label names to numeric values, i.e. ‘H’ : 0, we can reverse the dictionary above to map the numbers to the names

inv_map = {v: k for k, v in le_name_map.items()}
inv_map

{0: 'H', 1: 'L', 2: 'N'}

submission_df['label'] = submission_df['label'].map(inv_map)  
submission_df.head()

Save the prediction file.

submission_df.to_csv('solution.csv', index=False)

Next steps

The base model won’t be enough to make a good prediction; here are some next steps to improve upon the given approach.

More feature preprocessing and engineering
Use cross-validation to have a better measure of the performance.
Use Hyperopt or Optuna to tune the parameters of the LightGBM
Tune class_weight parameter of LightGBM directly to handle imbalance classes
Balance dataset by utilizing undersampling (taking a smaller sample from majority to match smaller sample) or resampling (using algorithms like SMOTE to augment dataset with fake data)
Test out other algorithms like KNN, SVM, XGBoost, Catboost, etc.
Join the bitgrit discord server to discuss the challenge with other data scientists

Predicting Rain with Machine Learning

Predicting Rain with Machine Learning

Will it rain tomorrow? Let’s build an AI model to answer that.

Problem Statement

The goal 🥅

The data 💾

Installing libraries and models

Load Libraries

Load the data

Exploratory Data Analysis

Distribution of target class

Plotting the features

Missing values

Feature Preprocessing & Engineering

Converting data types

Creating new weather features

Prepare train data

LightGBM

Model Performance

Confusion matrix

Classification Report

Calculating the metrics

Macro or weighted?

Feature Importance

Saving the model

Predict on test data

Save the prediction file.

Next steps

Resources

General

Hyperparameter tuning

Multi-class metrics