Building an XGBoost Model to Predict Video Popularity
A thorough guide to building a simple XGBoost machine learning model for a data science competition.
The factors that determine whether a video goes viral are hard to narrow down, especially since popularity can be so subjective. What if we could use objective factors such as video metadata and thumbnails to predict how many views a video will get? This is the goal of the Video Popularity Prediction Challenge hosted on the data science competition platform Bitgrit.
In any data science competition, it’s important to start by posing some questions to better understand the problem, the goal, and the data you’re given. So, let’s start by asking these questions:
What is the goal?
The goal of this challenge is to develop a machine learning model that predicts the number of views that videos are likely to receive based on attributes such as duration, language, the number of likes, and the day of the week.
Why predict video popularity?
Besides this algorithm simply being a cool application of data science, video creators want to know how to give their videos the best chance of going viral. If they can improve their chances by publishing their video at a certain time of day or with a certain title, they can improve their chances of being seen by a wider audience.
Also, on the company side, video hosting platforms could use such a prediction algorithm to determine which videos have the potential for high levels of popularity. This can help the video platform effectively price ads on those videos, because they wouldn’t want to place pricey ads on videos that no one would watch.
What does the data look like?
Meta Data
- Views
- Duration
- Language id
- Aspect Ratio
- Day of the week
- ...
Title Data <- Vectorized text (50 dims)
Description Data <- Vectorized text (50 dims)
Thumbnail Data <- reduced pixel data (4000 dims)
The metadata contains the duration, language, and various other information about each video, of particular importance is the Views column: the target variable that we are predicting.
The title and description data have already been vectorized — in other words, converted into a numerical representation — so that computers can understand the relationship between the text and the target variable. Thanks to these numerical representations, we can use them as features for our machine learning models.
Now that we’ve sorted out the why and the what, it is time to figure out the how.
How to approach this challenge?
It can be difficult to get started on a data science competition, especially for those just starting out. Luckily, Bitgrit hosted a webinar just last week — Getting Started: How to Build a Machine Learning Model — which was presented by Jorge Quinteros, a Data Scientist at Bitgrit.
Below, I will summarize lessons from the webinar along with the step-by-step approach to build an XGBoost model for this competition.
Why XGBoost?
Before we dive into the code, let’s answer an important question, which is why XGBoost was chosen for this particular problem and why it’s so popular in data science competitions in general.
Before we dive into the code, let’s answer an important question, which is why use XGBoost and why it’s so popular in data science competitions in general.
To really understand about XGBoost models, you would need foundation on decision trees, and gradient boosting. But the underlying idea of XGBoost is about forming a strong model from a large number of simple models with poor accuracy.
What really gives XGBoost the upper hand is it improves upon the baseline gradient boosting models by utilizing algorithmic enhancements and systems optimization that are built-in to the library.
A few examples of these enhancements are:
- Parallel build of trees
- Out-of-core computing
- Regularization (e.g. LASSO and Ridge) to avoid overfitting
- Handle missing values using sparse approach
- Built-in cross validation at each iteration
The downside is that it’s a blackbox model, which means it cannot tell you which features have more predictive power than others. For example, say that duration
has more of a stronger correlation to view
numbers and should therefore be a stronger predictor than title
or other features. With XGBoost, we don’t get to see this kind of information.
If you want to know more about gradient boosting, check out this helpful article by
who explains this concept well.
Now, it’s time to dive into the implementation!
Load libraries
# Load libraries
import os
import pandas as pd
import numpy as np
import math
import random
import collections
import timeit
import xgboost as xgb
import sklearn.metrics
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
We will import the following libraries:
pandas
,numpy
, andsklearn
are the must-haves for data sciencexgboost
is the Python library that supports the XGBoost model- The
os
library is for getting our current working directory math
can be used for data transformationcollections
for countingtimeit
to time our model training
Load Data
To get the dataset, go ahead and register for Bitgrit’s competition by March 31! Then paste this code into your code cell to load the data:
# Load training datasets
desc_train = pd.read_csv('/path/train_desc_df.csv')
meta_train = pd.read_csv('/path/train_meta_df.csv')
image_train = pd.read_csv('/path/train_image_df.csv')
title_train = pd.read_csv('/path/train_title_df.csv')
# Load public datasets (datasets used for the rankings)
desc_test = pd.read_csv('/path/public_desc_df.csv')
meta_test = pd.read_csv('/path/public_meta_df.csv')
image_test = pd.read_csv('/path/public_image_df.csv')
title_test = pd.read_csv('/path/public_title_df.csv')
print('Dimension of train description data is', desc_train.shape)
print('Dimension of train meta data is', meta_train.shape)
print('Dimension of train image data is', image_train.shape)
print('Dimension of train title data is', title_train.shape)
print('Dimension of test description data is', desc_test.shape)
print('Dimension of test meta data is', meta_test.shape)
print('Dimension of test image data is', meta_test.shape)
print('Dimension of test title data is', title_test.shape)
We’ll be importing our .csv
files into train datasets and test datasets which are conveniently named “train” and “public.” Here’s what our train data looks like:
We see that all our train data has about 3,000 rows, which is not too huge, but that our image data has 4,001 columns. This could be computationally intensive, so we might have to do some dimensionality reduction later on.
As for our test data, here is the output:
We see that there are 986 rows, which means there are metadata, image, description and titles of 986 videos given and our goal is to predict the number of views for each video.
Exploratory Data Analysis
Exploratory Data Analysis (or EDA for short) is important to discover trends in your data and figure out what transformations are needed to prepare them for modeling.
Metadata
By calling meta_train.head()
, we can get a peek into our metadata dataset.
meta_train.head()
Here, we can see all the features in our metadata for five videos, where views
is the column we want to predict (i.e. views are not provided for our meta_test
data).
Now let’s do the same for the rest of the dataset.
Image Data
image_train.head()
Here, we see that our image data is normalized within a range from -1 to 1. There’s not much we can do with this data, but it’s good to know what it looks like.
Description Data
desc_train.head()
Each column represents one coordinate in a 50-dimension space. While we cannot visualize this data, it’s important to know that vectors with similar values means that the original text was also similar.
Title Data
title_train.head()
In the title data, we can see a similar pattern.
Missing Values
An important step is to check for missing values, which we can do by typing this into the code cell:
meta_train.isnull().sum()
Phew, no missing values! If we did have them, there are many ways to tackle it, but the general case is to perform imputation to replace those missing values with the mean, medium, or using nearest neighbors.
Data Types
meta_train.dtypes
This code shows that most of our metadata are integers. Right now, some of our columns are categorical data, which means we have to convert them into dummy variables with one-hot encoding later on when we get into data preprocessing.
Plotting
It’s time for some plotting! Let’s look at the distribution of our hour
column, which represents when the video was uploaded to the hosting platform.
meta_train[‘hour’].hist(bins=24)
We see that there’s a spike in our hours data at around 5am. Since the video platform has creators uploading in many time zones, this might not represent the time for their time zone, but it’s still useful information.
Next, let’s look at the dayoftheweek
distribution, which will tell us which day of the week the video was uploaded.
meta_train[‘dayofweek’].hist(bins=24)
It looks like we have no data for Friday, that’s weird. We do see that the most uploads are on Monday and Saturday.
Next up, we’ll look at the distribution of our views
column.
meta_train[‘views’].hist(bins=20)
Because we have outliers at the upper end of this view data, this plot isn’t very useful. The solution here is to apply a log transformation to our data, like so:
np.log10(meta_train['views']).hist(bin=20)
Aha! This distribution is much more useful. We now can tell that the distribution is right-skewed, with the most frequent views being around 10² (or 100), and the least frequent views around 10⁴ (or 10,000).
Now we’ll look at the distribution of our duration
column.
meta_train[‘duration’].hist(bins=20)
We see a similar problem of outliers, but this time we can’t apply the log transformation like before. Because our duration
column has some zero values, applying log transformation would produce -inf
values. Therefore, we must replace them with zero values in order to plot a log version of our plot.
duration_log = np.log10(meta_train['duration']).replace(-np.inf, 0)
duration_log.hist(bins=20)
Now we can see that the duration is mostly 10² seconds, which represents around a 2-minute long video.
Value Counts
Let’s speed run through the other values in our metadata.
meta_train['ad_blocked'].value_counts()
Most of the videos are not ad blocked.
meta_train['embed'].value_counts()
And most of the videos are embedded.
Cross-tabulation
pd.crosstab(meta_train.partner, meta_train.partner_active)
By performing cross-tabulation on the active and inactive partners, we can see that some of the videos have inactive partners.
Correlations
# correlation btwn numerical variables
cor_tbl_df = meta_train[['views', 'ratio', 'duration', 'language', 'n_likes', 'n_tags','n_formats', 'dayofweek', 'hour']]
sort_n = cor_tbl_df.corr().sort_values('views', ascending=False).index
cor_tbl_df.corr()[sort_n].iloc[0]
Interesting! With the corr()
function in pandas, we can see that n_likes
, language
, and ratio
have high correlation with our views
variable.
Now that we’ve done a fair bit of EDA and we better understand our data, let’s do some data preprocessing.
Data Preprocessing
Applying one-hot encoding
One-hot encoding is the process of converting categorical data into a binary vector representation for use in machine learning algorithms. If you want to read more about it, check out this article on the topic by
To perform one-hot-encoding, pandas has a nifty function called get_dummies()
for us to convert our categorical variables into dummy variables.
The first parameter is our variable, and the second is the prefix which is the String to append the data frame column names.
Ex: if we pass our language
column to get_dummies
with the prefix
as “language”, we will get langauge_1
, language_2
, langauge_3
… language_10
if we set the prefix as “lang”, we will get lang_1
, lang_2
and so on.
embed = pd.get_dummies(meta_train.embed, prefix ='embed')
partner = pd.get_dummies(meta_train.partner, prefix ='partner')
partner_active = pd.get_dummies(meta_train.partner_active, prefix ='partner_a')
language = pd.get_dummies(meta_train['language'], prefix='language')
weekday = pd.get_dummies(meta_train['dayofweek'], prefix='day')
weekday['day_6'] = 0
Note that weekday[‘day_6’]
was set to 0 because the data was missing day 6 (Friday). Before applying one-key encoding, remember to watch out for missing data!
To get an idea of what each data frame looks like, let’s take a look at the language data frame.
Instead of having one column with integers from 0 to 10, we now have 10 columns where each row has a binary representation of which language it is (ex: in row 1, the language is language 3).
Cyclical features encoding
Since our data includes hour
data, we can’t do simple one-hot encoding because time data has temporal features. That said, we will be performing cyclical features encoding like the following. This is where the math
library comes in handy!
sin_hour = np.sin(2*np.pi*meta_train['hour']/24.0)
sin_hour.name = 'sin_hour'
cos_hour = np.cos(2*np.pi*meta_train['hour']/24.0)
cos_hour.name = 'cos_hour'
After the transformations, we can now join all of the 6 transformed data columns into one data frame using the concat method.
# Join all dataframes.
meta_final_df = pd.concat([meta_train[['comp_id', 'views', 'ratio', 'language', 'n_likes', 'duration']].reset_index(drop=True),
embed, partner, partner_active, language, weekday, sin_hour, cos_hour], axis=1)
meta_final_df.head()
meta_final_df.shape
Calling the shape
method from pandas, our final meta_final_df
now has a whopping 31 columns!
Lasso regression
As mentioned earlier, our image data has 4,000 columns, which can slow down our model significantly. To avoid this, we will be using lasso regression.
The main idea of lasso is to obtain several predictors that minimize the prediction error for a quantitative target variable by imposing a constraint on the model parameters that causes some variables to shrink to zero. This last part is important because it allows us to reduce the dimensions of our image data.
First, let’s set our views
column as our target variable and consider all columns except for comp_id
(because it isn’t an image pixel) as features.
# Set the target as well as dependent variables from image data.
y = meta_train['views']
x = image_train.loc[:, image_train.columns != 'comp_id'] #ignore comp_id variable
# Run Lasso regression for feature selection.
sel_model = SelectFromModel(LogisticRegression(C=1, penalty='l1', solver='liblinear'))
# time the model fitting
start = timeit.default_timer()
# Fit the trained model on our data
sel_model.fit(x, y)
stop = timeit.default_timer()
print('Time: ', stop - start)
We then build our model with sklearn’s SelectFromModel with the following arguments:
C=1
→ Inverse of regularization strength, where smaller values specify stronger regularizationpenalty = ‘l1’
→ Choosing L1 regression which is LASSOsolver = ‘liblinear’
→ Algorithm to use in the optimization problem
*Read more about sklearn’s logistic regression on the official documentation.
Next, we fit the model to our x and y values.
# get index of good features
sel_index = sel_model.get_support()
# count the no of columns selected
counter = collections.Counter(sel_model.get_support())
counter
After it trains, we can use the get_support()
function to find the integer index of the features we selected.
Output: Counter({False: 2742, True: 1258})
With the collections library, we have a total of 1,258 columns left — down from 4,000!
# Reconstruct the image dataframe using the index information above.
image_index_df = pd.DataFrame(x[x.columns[(sel_index)]])
image_final_df = pd.concat([image_train[['comp_id']], image_index_df], axis=1)
image_final_df.head()
With indexes of the important features, we can then subset out our original image data, and then concatenate it back with the axis=1
argument which means add by column. Here’s what our data frame looks like now:
Merge everything into one data frame
Now that we’ve performed the necessary transformation on our data, it’s time to merge all of the separate datasets into one so we can use it for our machine learning model!
# Merge all tables based on the column 'comp_id'
final_df = pd.merge(pd.merge(meta_final_df, image_final_df, on = 'comp_id'),
pd.merge(desc_train, title_train, on = 'comp_id'), on = 'comp_id')
final_df.shape # (3000, 1389)
When merging data frames, it’s good to have similar columns so that it’s easier to merge them with the on
argument. In our case, the comp_id
column is present in all our datasets.
Our final data frame has 3,000 rows and 1,389 columns, most of which are from our image data.
Preprocessing on Public/Test Data
Whenever we apply transformation to our training data, we have to do the same with our public data, so let’s do that now using this code:
# Test set
p_embed = pd.get_dummies(meta_test.embed, prefix ='embed')
p_partner = pd.get_dummies(meta_test.partner, prefix ='partner')
p_partner_active = pd.get_dummies(meta_test.partner_active, prefix ='partner_a')
p_language = pd.get_dummies(meta_test['language'], prefix='language')
p_language['language_6'] = 0
p_weekday = pd.get_dummies(meta_test['dayofweek'], prefix='day')
p_weekday['day_3'] = 0
p_weekday['day_4'] = 0
p_weekday['day_5'] = 0
## Cyclical encoding
p_sin_hour = np.sin(2*np.pi*meta_test['hour']/24.0)
p_sin_hour.name = 'sin_hour'
p_cos_hour = np.cos(2*np.pi*meta_test['hour']/24.0)
p_cos_hour.name = 'cos_hour'
# Join all dataframes.
p_meta_final_df = pd.concat([meta_test[['comp_id', 'ratio', 'language', 'n_likes', 'duration']].reset_index(drop=True),
p_embed, p_partner, p_partner_active, p_language, p_weekday, p_sin_hour, p_cos_hour], axis=1)
p_meta_final_df.head()
# subset our test image dataframe with index used on training set
p_image_final_df = pd.concat([image_test[['comp_id']], image_index_df], axis=1)
# Merge all test set tables.
p_final_df = pd.merge(pd.merge(p_meta_final_df, p_image_final_df, on = 'comp_id'),
pd.merge(desc_test, title_test, on = 'comp_id'), on = 'comp_id')
p_final_df.shape
This transformation is the same as what we used on our training data, so it should be self-explanatory.
After merging all our public set data frames, we get a dimension of (986, 1388).
Building the XGBoost Model
Now for the fun part — it’s time to start building our machine learning model!
The standardized way of training models is to take your train set, split it into a new training and validation set. This is done in order to prevent overfitting on our training data.
Overfitting is where you train your model on the training data, and it picks up all the noise in training data. This causes the model to not generalize well when it is applied on new data.
We train our model with our training data then fit it on our validation data to get a rough idea of how well our model is performing. We’ll also fine tune the parameters in our model or perform more data cleaning as needed.
When we are satisfied with our model accuracy, we apply it on our final testing data and submit it to the data science competition.
Train_test_split
Let’s go ahead and do that with our final_df
train set:
# Convert dataframe to numpy array.
X = final_df.drop(['comp_id', 'views'], axis=1).to_numpy()
y = final_df.loc[:, 'views'].to_numpy()
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 121)
print('Training set shape ', X_train.shape)
print('Test set shape ', X_test.shape)
X in the code above represents our features, and y is our target variable (or views
). We convert the data frames into numpy arrays for optimization and to turn them into a data matrix for the XGBoost model training.
Next. we split the numpy arrays using the train_test_split()
function, where test_size
is the percentage of our test (25% in this case) and random_state
stands for a reproducible number (any number) to reproduce the result of the split.
Training set shape (2250, 1386)
Test set shape (750, 1386)
Here, we can see that we have a 0.75 train and 0.25 test split.
Data matrix
Now we build a data matrix, which is an internal data structure used in XGBoost models for efficiency.
trlabel = y_train
telabel = y_test
dtrain = xgb.DMatrix(X_train, label=trlabel)
dtest = xgb.DMatrix(X_test, label=telabel)
To create a DMatrix
, we fit the our data with the features (X_train
and X_test
), and we tell the function that our labels are the target variables in y_train
and y_test
(which is the views
column).
Setting Parameters
Next, we define the parameters for our XGBoost models in dictionary formats.
# Set parameters.
param = {'max_depth': 7,
'eta': 0.2,
'objective': 'reg:squarederror',
'nthread': 5,
'eval_metric': 'rmse'
}
evallist = [(dtest, 'eval'), (dtrain, 'train')]
max_depth
= Maximum depth of a decision treeeta
= learning rate for our modelnthread
= Number of parallel threads used to run XGBoosteval_metric
= Evaluation metrics for validation data
The eval list variable is to specify what you want the output to be named.
To find more parameters for XGBoost models, head over to the official docs.
Train the model
# Train the model.
num_round = 70
bst = xgb.train(param, dtrain, num_round, evallist)
We set the number of rounds/trees for the model as 70, which means we will be generating 70 decision trees and use the best-performing one based on our evaluation metric, which is the root mean squared error (RSME).
To train our model, we will fit the arguments: parameters, training set, number of rounds and our evaluation list to the xgb.train
function.
You should be seeing something like this when the model is training. Notice how the RMSE is decreasing each time? That’s the gradient descent algorithm in action!
Out-of-sample Error
Next, we make predictions on the test data based on our trained XGBoost model and compute the RMSE.
# Make prediction.
ypred = bst.predict(dtest).round()
# Compute RMSE on test set.
mse_xgboost = mean_squared_error(y_test, ypred)
rmse_xgboost = math.sqrt(mse_xgboost)
print('RMSE with XGBoost', rmse_xgboost)
Output: RMSE with XGBoost 1133.909102177066
.
Fit model to public data (test data)
d_public = xgb.DMatrix(p_final_df.loc[:, p_final_df.columns != 'comp_id'][bst.feature_names])
solution = bst.predict(d_public).round()
solution_df = pd.concat([p_final_df[['comp_id']], pd.DataFrame(solution, columns = ['views'])], axis=1)
solution_df.to_csv('solution.csv', index=False)
We create a similar data matrix, ignore comp_id
and subset based on the features used to train our model.
Next, we make a prediction on the public dataset. Our solution variable is a matrix of views that we predicted using our XGBoost model.
The final step is to build our data frame by concatenating our comp_id and the views we predicted. The end product is something like this:
And with this, we’re ready to publish the submission file and upload it to the Bitgrit competition!
Tips for improving model
After following this guide, you now have a working model that is making predictions. But say that you want to improve its accuracy. For that, I recommend the following next steps:
1. Hyper-parameter Tuning
When training models, tuning the parameters is essential to improve your model performance. In competitions, even the slightest improvements in accuracy (0.001) can be what you need to climb up the leaderboard.
Here are two awesome articles that are about tuning XGBoost models:
- Fine-tuning XGBoost in Python like a boss by Félix Revert
- How to Tune the Number and Size of Decision Trees with XGBoost in Python by Jason Brownlee
2. Try other algorithms
There are tons of other algorithms that you can try out besides XGBoost. A couple of popular alternatives are CatBoost and Light GBM.
You can also try out deep learning techniques using the popular libraries TensorFlow and PyTorch.
If you want to implement this XGBoost model yourself, check out bitgrit’s online AI competitions. These competitions are a great way to get hands-on with data and experiment with different machine learning algorithms, so don’t miss out on this challenge!