Back to blog

Predicting Rain with Machine Learning

Predicting Rain with Machine Learning

Will it rain tomorrow? Let’s build an AI model to answer that.

Photo by Atilla Bingöl on Unsplash

Looking for a fun challenge to grow your data science skills?

Here’s a real-world problem for you.

Problem Statement

The weather has a significant impact on the agricultural industry and because of that, being able to predict it helps farmers in their day-to-day decisions such as how to plan efficiently, minimize costs and maximize yields.

A major agricultural company needs you to help them maximize growth efficiency, save resources and optimize their production.

To achieve these things, the company needs to have an accurate weather prediction algorithm that will improve their decision-making on typical farming activities such as planting and irrigating.

Using historical weather information from their region, can you predict what the weather will be like in the next few days?

You now have a clear goal.

The goal 🥅

Predict the next day’s weather based on three labels.

  • N— No rain
  • L— Light rain
  • H — Heavy rain

Let’s now look at the data

The data 💾

📂 train
├── region_A_train.csv
├── region_B_train.csv
├── region_C_train.csv
├── region_D_train.csv
├── region_E_train.csv
├── solution_format.csv
└── solution_train.csv📂 test
├── region_A_test.csv
├── region_B_test.csv
├── region_C_test.csv
├── region_D_test.csv
└── region_E_test.csv

The data has been conveniently split into train and test datasets.

In each train and test, you’re given weather data which consists of anonymized locations named region A through region E, which are all neighboring regions.

Here’s a look at the first five rows of region_A_train.csv

dateavg.tempmax.tempmin.tempprecipitationavg.wind.speedmax.wind.speedmax.wind.speed.dirmax.inst.wind.speedmax.inst.wind.speed.dirmin.atmos.pressure
229b70a33.310.2-2.402.99.3W14.3W1015.1
3134f4ff5.713.7-2.903.610.7W15.8W1011.3
dbfaf91013.820905.39.4SW15.2W1004.2
3aea0cf011.419.35.804.210.1SW20.6SW1001.7
f0227f562.47.70.343.50.93.7SW5.7SW1003.5

The first thing you should notice is that the date column isn’t a date but was anonymized to be some random value.

There is a total of 10 features, which are composed of temperature, precipitation, wind speed, wind speed direction, and atmospheric pressure

Then, looking at solution_format.csv

datelabel
a8c6911bN
eebdce12N
6fb420a6L
3bf8b132N

We can utilize the date column to join it with the training data and build a model.

Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.

Get the data by registering for this data science competition and follow along with this article!

Code for this article → Google collab or Deepnote.

Installing libraries and models

First, install the weapon of our choice: lightgbm

%pip install lightgbm

Load Libraries

Next, we load up some essential libraries for visualizations and machine learning.

# essentials
import numpy as np 
import pandas as pd 

# plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.dpi'] = 100
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(style="whitegrid")
%matplotlib inline

# ml
from sklearn.metrics import accuracy_score, recall_score, ConfusionMatrixDisplay, classification_report, auc, precision_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
import joblib

Load the data

Let’s read in all the data we have.

train_a = pd.read_csv("data/comp_datasets_train/region_A_train.csv")
train_b = pd.read_csv("data/comp_datasets_train/region_B_train.csv")
train_c = pd.read_csv("data/comp_datasets_train/region_C_train.csv")
train_d = pd.read_csv("data/comp_datasets_train/region_D_train.csv")
train_e = pd.read_csv("data/comp_datasets_train/region_E_train.csv")

test_a = pd.read_csv("data/comp_datasets_test/region_A_test.csv")
test_b = pd.read_csv("data/comp_datasets_test/region_B_test.csv")
test_c = pd.read_csv("data/comp_datasets_test/region_C_test.csv")
test_d = pd.read_csv("data/comp_datasets_test/region_D_test.csv")
test_e = pd.read_csv("data/comp_datasets_test/region_E_test.csv")

labels_df = pd.read_csv("data/comp_datasets_train/solution_train.csv")

Exploratory Data Analysis

It’s time for the fun part, visualizing the data.

train_a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566 entries, 0 to 565
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date                     566 non-null    object 
 1   avg.temp                 566 non-null    float64
 2   max.temp                 566 non-null    float64
 3   min.temp                 566 non-null    float64
 4   precipitation            566 non-null    float64
 5   avg.wind.speed           566 non-null    float64
 6   max.wind.speed           566 non-null    float64
 7   max.wind.speed.dir       566 non-null    object 
 8   max.inst.wind.speed      566 non-null    float64
 9   max.inst.wind.speed.dir  566 non-null    object 
 10  min.atmos.pressure       566 non-null    float64
dtypes: float64(8), object(3)
memory usage: 48.8+ KB

Distribution of target class

Let’s first visualize our target class.

sns.countplot(x = 'label', data = labels_df, palette="Set1");

It appears we have in our hands an imbalanced class, as the N label is dominating the rest of the classes.

Why is this a problem?

The model will be biased towards classes with a larger amount of samples.

This happens because the classifier has more information on classes with more samples, so it learns how to predict those classes better while it remains weak on the smaller classes.

In our case, the label N will be predicted more than other classes.

Let’s keep this in mind and move forward to more visualizations.

Plotting the features

We have ten features in each region.

train_all_lvls.columns[2:]
Index(['avg.temp', 'max.temp', 'min.temp', 'precipitation', 'avg.wind.speed',
       'max.wind.speed', 'max.wind.speed.dir', 'max.inst.wind.speed',
       'max.inst.wind.speed.dir', 'min.atmos.pressure'],
      dtype='object')

It’s time to cook up some plots.

Since all the regions are used to predict the next day’s weather, let’s see whether all the regions share similar patterns in the features and whether any outliers or anomalies exist.

fig, axes = plt.subplots(5,2,figsize=(14, 30), dpi=100)

for i, col_name in enumerate(train_all_lvls.columns[2:]):
    if train_all_lvls[col_name].dtype == 'O':
        train_all_lvls.groupby('region')[col_name].hist(ax=axes[i%5][i//5], alpha=0.6);
        axes[i%5][i//5].legend(["A", "B", "C", "D", "E"]);
    else:
        train_all_lvls.groupby('region')[col_name].plot(ax=axes[i%5][i//5], alpha=0.7);
        axes[i%5][i//5].legend();
    axes[i%5][i//5].set_title(f'{col_name}', fontsize=13);
    plt.subplots_adjust(hspace=0.45)

From the plots, we see that the patterns in the data are very similar except for regions C, D, E for min.wind.speed and avg.wind.speed which are on the lower scale.

Now that we’ve explored the data a little, we check for missing values in our data.

Missing values

Using my custom helper function, there seems to be a large amount of data missing for the min.atmos.pressure variable

missing_cols(train_all_lvls)
min.atmos.pressure => 2264 [80.0%]

Let’s also use a heatmap to visualize the missing data for that column.

plt.figure(figsize=(10, 6))
sns.heatmap(train_all_lvls.isnull(), yticklabels=False, cmap='viridis', cbar=False);

There are multiple ways we can deal with missing data.

Let’s first look at the distribution of the column with missing values.

train_all_lvls['min.atmos.pressure'].hist();

Since the distribution of the column is pretty normal, let’s impute the missing data with the mean.

mean_atmos = train_all_lvls['min.atmos.pressure'].mean()
train_all_lvls.fillna(mean_atmos, inplace=True)

We then do the same for the test data (full code in the notebook)

Feature Preprocessing & Engineering

Converting data types

Let’s first add the labels to our data.

train_all_lvls = train_all_lvls.merge(labels_df, on="date")

Then we take a look at the categorical columns for our dataset.

train_all_lvls.select_dtypes('object').columns
Index(['region', 'date', 'max.wind.speed.dir', 'max.inst.wind.speed.dir',
       'label'],
      dtype='object')

We’ll have to convert the categorical features, including the target variable to a numerical format.

Let’s use scikit-learn’s Label Encoder to do that.

Here’s an example of using LabelEncoder() on the label column

le = LabelEncoder()
le.fit(train_all_lvls['label'])
le_name_map = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_map

{'H': 0, 'L': 1, 'N': 2}

Creating new weather features

Since we’re given some weather features, there are interesting new features we can engineer.

One example is the Beaufort scale, which is “an empirical measure that relates wind speed to observed conditions at sea or on land.”

Below is a function to do our feature engineering.

BEAUFORT = [
    (0, 0, 0.3),
    (1, 0.3, 1.6),
    (2, 1.6, 3.4),
    (3, 3.4, 5.5),
    (4, 5.5, 8),
    (5, 8, 10.8),
    (6, 10.8, 13.9),
    (7, 13.9, 17.2),
    (8, 17.2, 20.8),
    (9, 20.8, 24.5),
    (10, 24.5, 28.5),
    (11, 28.5, 33),
    (12, 33, 200),
]


def feature_eng(df):
    le = LabelEncoder()
    
    cat_cols = df.select_dtypes("object").columns[2:]

    for col in cat_cols:
        if df[col].dtype == "object":
            df[col] = le.fit_transform(df[col])

    # wind speed is in meter/second
    # convert to knots to obtain beaufort scale
    for item in BEAUFORT:
        df.loc[
            (df["avg.wind.speed"] * 1.944 >= item[1]) & (df["avg.wind.speed"] * 1.944 < item[2]),
            "avg_beaufort_scale",
        ] = item[0]
        df.loc[
            (df["max.wind.speed"] * 1.944 >= item[1]) & (df["max.wind.speed"] * 1.944 < item[2]),
            "max_beaufort_scale",
        ] = item[0]

    df['avg_beaufort_scale'] = df['avg_beaufort_scale'].astype(int)
    df['max_beaufort_scale'] = df['max_beaufort_scale'].astype(int)

    return df

We can easily apply the feature engineering steps on the train and test dataset.

train = feature_eng(train_all_lvls)
test = feature_eng(test_all_lvls)

Prepare train data

Let’s look at our train data so far.

train.head()
regiondateavg.tempmax.tempmin.tempprecipitationavg.wind.speedmax.wind.speedmax.wind.speed.dirmax.inst.wind.speedmax.inst.wind.speed.dirmin.atmos.pressurelabelavg_beaufort_scalemax_beaufort_scale
A229b70a33.310.2-2.402.99.3714.371015.1248
B229b70a32.69.3-3.302.86.3311.431010.216254236
C229b70a31.49-4.3012.968.211010.216254224
D229b70a33.711.7-2.601.55.469.441010.216254225
E229b70a3-0.65.2-5.901.53.839.871010.216254224

It’s time to pivot our data into the longer format, which means instead of the region being a column, each feature will have its own region, such as avg.temp_A, avg.temp_B, until avg.temp_E for the rest of the features.

This way the data will be in the right shape to build the model.

We can use the pivot_table function in pandas to achieve that.

train = train.pivot_table(index=["date", "label"], columns="region")
train = pd.DataFrame(train.to_records())
train.head()
datelabel(‘avg.temp’, ‘A’)(‘avg.temp’, ‘B’)(‘avg.temp’, ‘C’)(‘avg.temp’, ‘D’)(‘avg.temp’, ‘E’)(‘avg.wind.speed’, ‘A’)(‘avg.wind.speed’, ‘B’)(‘avg.wind.speed’, ‘C’)(‘min.temp’, ‘A’)(‘min.temp’, ‘B’)(‘min.temp’, ‘C’)(‘min.temp’, ‘D’)(‘min.temp’, ‘E’)(‘precipitation’, ‘A’)(‘precipitation’, ‘B’)(‘precipitation’, ‘C’)(‘precipitation’, ‘D’)(‘precipitation’, ‘E’)
00173aec218.717.616.919.514.31.61.80.914.912.512.916.4911.51.504.5
0083f291113.112.6121310.71.410.711.311.110.612.18.55046.54945.563
014cfe7b219.91917.519.916.23.73.60.716.113.515.816.312.4181763.51932
01947c8e221.620.220.521.317.61.61.21.115.814.215.515.911.501000
0258884d215.213.913.915.811.12.52.31.210.35.68.910.45.300000

The column names aren’t ideal, so I wrote a function to fix that.

def replace_all(text):
    d = { "('": "", "', '": "_", "')" : "",}
    for i, j in d.items():
        text = text.replace(i, j)
    return text

# ('avg.temp', 'A') -> avg.temp_A

test_str = "('avg.temp', 'A')"
replace_all(test_str)


'avg.temp_A'
train.columns = list(map(replace_all, train.columns))

Here’s what the train data looks like so far!

train
datelabelavg.temp_Aavg.temp_Bavg.temp_Cavg.temp_Davg.temp_Eavg.wind.speed_Aavg.wind.speed_Bavg.wind.speed_Cmin.temp_Amin.temp_Bmin.temp_Cmin.temp_Dmin.temp_Eprecipitation_Aprecipitation_Bprecipitation_Cprecipitation_Dprecipitation_E
00173aec218.717.616.919.514.31.61.80.914.912.512.916.4911.51.504.5
0083f291113.112.6121310.71.410.711.311.110.612.18.55046.54945.563
014cfe7b219.91917.519.916.23.73.60.716.113.515.816.312.4181763.51932
01947c8e221.620.220.521.317.61.61.21.115.814.215.515.911.501000
0258884d215.213.913.915.811.12.52.31.210.35.68.910.45.300000
fe2a138512.91.61.43.40.41.20.90.7-4-5.9-2.8-3.4-5.900000
fe6dd99c12.92.93.92.90.21.63.911.71.41.91-1.221248
ff88c3dd19.88.9910.26.31.91.51.41.5-0.122.5-1.900000
ff929090210.48.17.1115.74.34.60.93.41.5-0.63.2-1.603.55.501.5
ffe3bd74222.722.421.721.719.43.75.40.818.818.818.418.915.22813.525.54449

After we do the same to the test data, it’s time to split the train data.

X, y = train.drop(["label", "date"], axis=1), train[["label"]].values.flatten()

LightGBM likes it when the categorical features are given the categorical data type, so let’s do that.

# Extract categoricals and their indices
cat_feats = X.select_dtypes(include=['int64']).columns.to_list()
cat_idx = [X.columns.get_loc(col) for col in cat_feats]

# Convert cat_features to pd.Categorical dtype
for col in cat_feats:
    X[col] = pd.Categorical(X[col])

We can use train_test_split to split our data into the training and evaluation sets.

X_train, X_eval, y_train, y_eval = train_test_split(
    X, y, test_size=0.25, random_state=0)

It’s time to build the LightGBM model!

LightGBM

Let’s create a base lgb.LGBMClassifier for a simple prediction

We can then fit the training data.

clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train)

LGBMClassifier()

Then we call the predict function on the evaluation data

y_pred=clf.predict(X_eval.values)

Model Performance

Let’s see how a base LightGBM classifier did.

A 99% accuracy can be meaningless for an imbalanced dataset, so we need more suitable metrics like precision, recall, and a confusion matrix.

Confusion matrix

Let’s create a confusion matrix for our model predictions.

First, we need to get the class names and the labels that the label encoder gave so our plot can show the label names.

We then plot a non-normalized and normalized confusion matrix.

class_names = le_name_map.keys()

titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
    fig, ax = plt.subplots(figsize=(10, 10))

    disp = ConfusionMatrixDisplay.from_estimator(
        clf,
        X_eval,
        y_eval,
        display_labels=class_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
        ax = ax
    )
    disp.ax_.set_title(title)
    disp.ax_.grid(False)

    print(title)
    print(disp.confusion_matrix)
Confusion matrix, without normalization
[[ 3  4  2]
 [ 1 12 26]
 [ 0 19 75]]
Normalized confusion matrix
[[0.33333333 0.44444444 0.22222222]
 [0.02564103 0.30769231 0.66666667]
 [0.         0.20212766 0.79787234]]

As you can see from the shade of the plot, our model is predicting the label N much more than others.

With the values you see in the plot above, we can compute some metrics that tell us how well our model is doing.

Classification Report

A classification report measures the quality of predictions from a classification algorithm.

It answers the question of how many predictions are True and how many are False.

More specifically, it uses True Positives, False Positives, True Negatives, and False Negatives to compute the metrics of precision, recall, and f1-score

print(classification_report(y_pred, y_eval))

Let’s break down the output.

PrecisionWhat percent of your predictions were correct?
RecallWhat percent of the positive cases did you catch?
F1-scoreThe weighted average of Precision and Recall
supportThe number of occurrences of each given class

If you still have a hard time understanding precision and recall, read this great explanation.

Calculating the metrics

Taking class 0, which is the label H , let’s calculate the precision, recall, and f1-score from the values in the confusion matrix

TP - True Positive | FP - False Positive
FN - False Negative | TN - True NegativePrecision = TP/(FP+TP) = 3/(3+4+2) = 3/9 = 0.33
Recall = TP/(TP+FN) = 3/(3+1+0) = 3/4 = 0.75F1-score = 2 * (Recall * Precision) / (Recall + Precision)
= 2 * 0.33 * 0.75 / (0.33 + 0.75)
= 0.495 / 1.08
= 0.458

Macro or weighted?

You might notice that there are three values for the overall F1-score 0.64, 0.53, 0.65 . Which value should you focus on?

In an imbalanced dataset where all classes are equally important, macro average is a good choice as it treats all classes equally.

If however, you want to assign greater weight to classes with more samples in the data, then the weighted average is preferred.

One more metric you can use is Precision-Recall, which is a useful measure of the success of prediction when the classes are very imbalanced.

Feature Importance

Let’s also plot the feature importance to see which features matter more.

feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,X.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 15))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features')
plt.tight_layout()
# plt.savefig('lightgbm_fimp.png')

From the plot above, the wind speed features and precipitation are the key features that are good predictors.

The Beaufort scale feature we created seems to have very low importance, so it might be better to remove them.

Interestingly, min.atmos.pressure in region A is the most important, wheres min.atmos.pressure in other regions are among the lowest in importance.

Saving the model

When you’re satisfied with your model, you can save it as a pickle file with joblib

joblib.dump(clf, 'lgb1.pkl')

Predict on test data

We have a simple model built. It’s time to predict it on the test data.

X = test.drop('date', axis=1)

# Convert cat_features to pd.Categorical dtype
for col in cat_feats:
    X[col] = pd.Categorical(X[col])

Let’s have a peek at the final submission file.

test_preds = clf.predict(X)
submission_df = pd.concat([test['date'], pd.DataFrame(test_preds, columns=['label'])], axis=1)
submission_df.head()

The labels are still encoded as numeric values; let’s bring the actual label names back.

Since we already have the dictionary of the mapping of label names to numeric values, i.e. ‘H’ : 0, we can reverse the dictionary above to map the numbers to the names

inv_map = {v: k for k, v in le_name_map.items()}
inv_map

{0: 'H', 1: 'L', 2: 'N'}
submission_df['label'] = submission_df['label'].map(inv_map)  
submission_df.head()

Save the prediction file.

submission_df.to_csv('solution.csv', index=False)

Next steps

The base model won’t be enough to make a good prediction; here are some next steps to improve upon the given approach.

  1. More feature preprocessing and engineering
  2. Use cross-validation to have a better measure of the performance.
  3. Use Hyperopt or Optuna to tune the parameters of the LightGBM
  4. Tune class_weight parameter of LightGBM directly to handle imbalance classes
  5. Balance dataset by utilizing undersampling (taking a smaller sample from majority to match smaller sample) or resampling (using algorithms like SMOTE to augment dataset with fake data)
  6. Test out other algorithms like KNN, SVM, XGBoost, Catboost, etc.
  7. Join the bitgrit discord server to discuss the challenge with other data scientists

Resources

General

Hyperparameter tuning

Multi-class metrics

Have fun in this competition!