Building an XGBoost Model to Predict Video Popularity

A thorough guide to building a simple XGBoost machine learning model for a data science competition.

The factors that determine whether a video goes viral are hard to narrow down, especially since popularity can be so subjective. What if we could use objective factors such as video metadata and thumbnails to predict how many views a video will get? This is the goal of the Video Popularity Prediction Challenge hosted on the data science competition platform Bitgrit.

In any data science competition, it’s important to start by posing some questions to better understand the problem, the goal, and the data you’re given. So, let’s start by asking these questions:

What is the goal?

The goal of this challenge is to develop a machine learning model that predicts the number of views that videos are likely to receive based on attributes such as duration, language, the number of likes, and the day of the week.

Why predict video popularity?

Besides this algorithm simply being a cool application of data science, video creators want to know how to give their videos the best chance of going viral. If they can improve their chances by publishing their video at a certain time of day or with a certain title, they can improve their chances of being seen by a wider audience.

Also, on the company side, video hosting platforms could use such a prediction algorithm to determine which videos have the potential for high levels of popularity. This can help the video platform effectively price ads on those videos, because they wouldn’t want to place pricey ads on videos that no one would watch.

What does the data look like?

Meta Data
- Views
- Duration
- Language id
- Aspect Ratio
- Day of the week
- ...
Title Data <- Vectorized text (50 dims)
Description Data <- Vectorized text (50 dims)
Thumbnail Data <- reduced pixel data (4000 dims)

The metadata contains the duration, language, and various other information about each video, of particular importance is the Views column: the target variable that we are predicting.

The title and description data have already been vectorized — in other words, converted into a numerical representation — so that computers can understand the relationship between the text and the target variable. Thanks to these numerical representations, we can use them as features for our machine learning models.

Now that we’ve sorted out the why and the what, it is time to figure out the how.

How to approach this challenge?

It can be difficult to get started on a data science competition, especially for those just starting out. Luckily, Bitgrit hosted a webinar just last week — Getting Started: How to Build a Machine Learning Model — which was presented by Jorge Quinteros, a Data Scientist at Bitgrit.

Below, I will summarize lessons from the webinar along with the step-by-step approach to build an XGBoost model for this competition.

Why XGBoost?

Before we dive into the code, let’s answer an important question, which is why XGBoost was chosen for this particular problem and why it’s so popular in data science competitions in general.

Before we dive into the code, let’s answer an important question, which is why use XGBoost and why it’s so popular in data science competitions in general.

To really understand about XGBoost models, you would need foundation on decision trees, and gradient boosting. But the underlying idea of XGBoost is about forming a strong model from a large number of simple models with poor accuracy.

What really gives XGBoost the upper hand is it improves upon the baseline gradient boosting models by utilizing algorithmic enhancements and systems optimization that are built-in to the library.

A few examples of these enhancements are:

Parallel build of trees
Out-of-core computing
Regularization (e.g. LASSO and Ridge) to avoid overfitting
Handle missing values using sparse approach
Built-in cross validation at each iteration

The downside is that it’s a blackbox model, which means it cannot tell you which features have more predictive power than others. For example, say that duration has more of a stronger correlation to view numbers and should therefore be a stronger predictor than title or other features. With XGBoost, we don’t get to see this kind of information.

If you want to know more about gradient boosting, check out this helpful article by

Prince Grover

who explains this concept well.

Now, it’s time to dive into the implementation!

Load libraries

# Load libraries
import os
import pandas as pd
import numpy as np
import math
import random
import collections
import timeit
import xgboost as xgb
import sklearn.metrics

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

We will import the following libraries:

pandas, numpy, and sklearn are the must-haves for data science
xgboost is the Python library that supports the XGBoost model
The os library is for getting our current working directory
math can be used for data transformation
collections for counting
timeit to time our model training

Load Data

To get the dataset, go ahead and register for Bitgrit’s competition by March 31! Then paste this code into your code cell to load the data:

# Load training datasets
desc_train = pd.read_csv('/path/train_desc_df.csv')
meta_train = pd.read_csv('/path/train_meta_df.csv')
image_train = pd.read_csv('/path/train_image_df.csv')
title_train = pd.read_csv('/path/train_title_df.csv')
# Load public datasets (datasets used for the rankings)
desc_test = pd.read_csv('/path/public_desc_df.csv')
meta_test = pd.read_csv('/path/public_meta_df.csv')
image_test = pd.read_csv('/path/public_image_df.csv')
title_test = pd.read_csv('/path/public_title_df.csv')
print('Dimension of train description data is', desc_train.shape)
print('Dimension of train meta data is', meta_train.shape)
print('Dimension of train image data is', image_train.shape)
print('Dimension of train title data is', title_train.shape)
print('Dimension of test description data is', desc_test.shape)
print('Dimension of test meta data is', meta_test.shape)
print('Dimension of test image data is', meta_test.shape)
print('Dimension of test title data is', title_test.shape)

We’ll be importing our .csv files into train datasets and test datasets which are conveniently named “train” and “public.” Here’s what our train data looks like:

We see that all our train data has about 3,000 rows, which is not too huge, but that our image data has 4,001 columns. This could be computationally intensive, so we might have to do some dimensionality reduction later on.

As for our test data, here is the output:

We see that there are 986 rows, which means there are metadata, image, description and titles of 986 videos given and our goal is to predict the number of views for each video.

Exploratory Data Analysis

Exploratory Data Analysis (or EDA for short) is important to discover trends in your data and figure out what transformations are needed to prepare them for modeling.

Metadata

By calling meta_train.head(), we can get a peek into our metadata dataset.

meta_train.head()

Here, we can see all the features in our metadata for five videos, where views is the column we want to predict (i.e. views are not provided for our meta_test data).

Now let’s do the same for the rest of the dataset.

Image Data

image_train.head()

Here, we see that our image data is normalized within a range from -1 to 1. There’s not much we can do with this data, but it’s good to know what it looks like.

Description Data

desc_train.head()

Each column represents one coordinate in a 50-dimension space. While we cannot visualize this data, it’s important to know that vectors with similar values means that the original text was also similar.

Title Data

title_train.head()

In the title data, we can see a similar pattern.

Missing Values

An important step is to check for missing values, which we can do by typing this into the code cell:

meta_train.isnull().sum()

Phew, no missing values! If we did have them, there are many ways to tackle it, but the general case is to perform imputation to replace those missing values with the mean, medium, or using nearest neighbors.

Data Types

meta_train.dtypes

This code shows that most of our metadata are integers. Right now, some of our columns are categorical data, which means we have to convert them into dummy variables with one-hot encoding later on when we get into data preprocessing.

Plotting

It’s time for some plotting! Let’s look at the distribution of our hour column, which represents when the video was uploaded to the hosting platform.

meta_train[‘hour’].hist(bins=24)

We see that there’s a spike in our hours data at around 5am. Since the video platform has creators uploading in many time zones, this might not represent the time for their time zone, but it’s still useful information.

Next, let’s look at the dayoftheweek distribution, which will tell us which day of the week the video was uploaded.

meta_train[‘dayofweek’].hist(bins=24)

It looks like we have no data for Friday, that’s weird. We do see that the most uploads are on Monday and Saturday.

Next up, we’ll look at the distribution of our views column.

meta_train[‘views’].hist(bins=20)

Because we have outliers at the upper end of this view data, this plot isn’t very useful. The solution here is to apply a log transformation to our data, like so:

np.log10(meta_train['views']).hist(bin=20)

*Logarithmic distribution histogram for views*

Aha! This distribution is much more useful. We now can tell that the distribution is right-skewed, with the most frequent views being around 10² (or 100), and the least frequent views around 10⁴ (or 10,000).

Now we’ll look at the distribution of our duration column.

meta_train[‘duration’].hist(bins=20)

We see a similar problem of outliers, but this time we can’t apply the log transformation like before. Because our duration column has some zero values, applying log transformation would produce -inf values. Therefore, we must replace them with zero values in order to plot a log version of our plot.

duration_log = np.log10(meta_train['duration']).replace(-np.inf, 0)
duration_log.hist(bins=20)

Now we can see that the duration is mostly 10² seconds, which represents around a 2-minute long video.

Value Counts

Let’s speed run through the other values in our metadata.

meta_train['ad_blocked'].value_counts()

Most of the videos are not ad blocked.

meta_train['embed'].value_counts()

And most of the videos are embedded.

Cross-tabulation

pd.crosstab(meta_train.partner, meta_train.partner_active)

By performing cross-tabulation on the active and inactive partners, we can see that some of the videos have inactive partners.

Correlations

# correlation btwn numerical variables
cor_tbl_df = meta_train[['views', 'ratio', 'duration', 'language', 'n_likes', 'n_tags','n_formats', 'dayofweek', 'hour']]
sort_n = cor_tbl_df.corr().sort_values('views', ascending=False).index
cor_tbl_df.corr()[sort_n].iloc[0]

Interesting! With the corr() function in pandas, we can see that n_likes, language, and ratio have high correlation with our views variable.

Now that we’ve done a fair bit of EDA and we better understand our data, let’s do some data preprocessing.

Data Preprocessing

Applying one-hot encoding

One-hot encoding is the process of converting categorical data into a binary vector representation for use in machine learning algorithms. If you want to read more about it, check out this article on the topic by

B. Chen.

To perform one-hot-encoding, pandas has a nifty function called get_dummies() for us to convert our categorical variables into dummy variables.

The first parameter is our variable, and the second is the prefix which is the String to append the data frame column names.

Ex: if we pass our language column to get_dummies with the prefix as “language”, we will get langauge_1, language_2, langauge_3… language_10 if we set the prefix as “lang”, we will get lang_1, lang_2 and so on.

embed = pd.get_dummies(meta_train.embed, prefix ='embed')
partner = pd.get_dummies(meta_train.partner, prefix ='partner')
partner_active = pd.get_dummies(meta_train.partner_active, prefix ='partner_a')
language = pd.get_dummies(meta_train['language'], prefix='language')
weekday = pd.get_dummies(meta_train['dayofweek'], prefix='day')
weekday['day_6'] = 0

Note that weekday[‘day_6’] was set to 0 because the data was missing day 6 (Friday). Before applying one-key encoding, remember to watch out for missing data!

To get an idea of what each data frame looks like, let’s take a look at the language data frame.

Instead of having one column with integers from 0 to 10, we now have 10 columns where each row has a binary representation of which language it is (ex: in row 1, the language is language 3).

Cyclical features encoding

Since our data includes hour data, we can’t do simple one-hot encoding because time data has temporal features. That said, we will be performing cyclical features encoding like the following. This is where the math library comes in handy!

sin_hour = np.sin(2*np.pi*meta_train['hour']/24.0)
sin_hour.name = 'sin_hour'
cos_hour = np.cos(2*np.pi*meta_train['hour']/24.0)
cos_hour.name = 'cos_hour'

After the transformations, we can now join all of the 6 transformed data columns into one data frame using the concat method.

# Join all dataframes.
meta_final_df = pd.concat([meta_train[['comp_id', 'views', 'ratio', 'language', 'n_likes', 'duration']].reset_index(drop=True),
embed, partner, partner_active, language, weekday, sin_hour, cos_hour], axis=1)
meta_final_df.head()
meta_final_df.shape

Calling the shape method from pandas, our final meta_final_df now has a whopping 31 columns!

Lasso regression

As mentioned earlier, our image data has 4,000 columns, which can slow down our model significantly. To avoid this, we will be using lasso regression.

The main idea of lasso is to obtain several predictors that minimize the prediction error for a quantitative target variable by imposing a constraint on the model parameters that causes some variables to shrink to zero. This last part is important because it allows us to reduce the dimensions of our image data.

First, let’s set our views column as our target variable and consider all columns except for comp_id (because it isn’t an image pixel) as features.

# Set the target as well as dependent variables from image data.
y = meta_train['views']
x = image_train.loc[:, image_train.columns != 'comp_id'] #ignore comp_id variable

# Run Lasso regression for feature selection.
sel_model = SelectFromModel(LogisticRegression(C=1, penalty='l1', solver='liblinear'))

# time the model fitting
start = timeit.default_timer()

# Fit the trained model on our data
sel_model.fit(x, y)

stop = timeit.default_timer()
print('Time: ', stop - start)

We then build our model with sklearn’s SelectFromModel with the following arguments:

C=1 → Inverse of regularization strength, where smaller values specify stronger regularization
penalty = ‘l1’ → Choosing L1 regression which is LASSO
solver = ‘liblinear’ → Algorithm to use in the optimization problem

*Read more about sklearn’s logistic regression on the official documentation.

Next, we fit the model to our x and y values.

# get index of good features
sel_index = sel_model.get_support()

# count the no of columns selected
counter = collections.Counter(sel_model.get_support())
counter

After it trains, we can use the get_support() function to find the integer index of the features we selected.

Output: Counter({False: 2742, True: 1258})

With the collections library, we have a total of 1,258 columns left — down from 4,000!

# Reconstruct the image dataframe using the index information above.
image_index_df = pd.DataFrame(x[x.columns[(sel_index)]])
image_final_df = pd.concat([image_train[['comp_id']], image_index_df], axis=1)
image_final_df.head()

With indexes of the important features, we can then subset out our original image data, and then concatenate it back with the axis=1argument which means add by column. Here’s what our data frame looks like now:

Merge everything into one data frame

Now that we’ve performed the necessary transformation on our data, it’s time to merge all of the separate datasets into one so we can use it for our machine learning model!

# Merge all tables based on the column 'comp_id'
final_df = pd.merge(pd.merge(meta_final_df, image_final_df, on = 'comp_id'),
pd.merge(desc_train, title_train, on = 'comp_id'), on = 'comp_id')

final_df.shape # (3000, 1389)

When merging data frames, it’s good to have similar columns so that it’s easier to merge them with the on argument. In our case, the comp_id column is present in all our datasets.

Our final data frame has 3,000 rows and 1,389 columns, most of which are from our image data.

Preprocessing on Public/Test Data

Whenever we apply transformation to our training data, we have to do the same with our public data, so let’s do that now using this code:

# Test set
p_embed = pd.get_dummies(meta_test.embed, prefix ='embed')
p_partner = pd.get_dummies(meta_test.partner, prefix ='partner')
p_partner_active = pd.get_dummies(meta_test.partner_active, prefix ='partner_a')
p_language = pd.get_dummies(meta_test['language'], prefix='language')
p_language['language_6'] = 0
p_weekday = pd.get_dummies(meta_test['dayofweek'], prefix='day')
p_weekday['day_3'] = 0
p_weekday['day_4'] = 0
p_weekday['day_5'] = 0

## Cyclical encoding
p_sin_hour = np.sin(2*np.pi*meta_test['hour']/24.0)
p_sin_hour.name = 'sin_hour'
p_cos_hour = np.cos(2*np.pi*meta_test['hour']/24.0)
p_cos_hour.name = 'cos_hour'

# Join all dataframes.
p_meta_final_df = pd.concat([meta_test[['comp_id', 'ratio', 'language', 'n_likes', 'duration']].reset_index(drop=True),
p_embed, p_partner, p_partner_active, p_language, p_weekday, p_sin_hour, p_cos_hour], axis=1)
p_meta_final_df.head()

# subset our test image dataframe with index used on training set
p_image_final_df = pd.concat([image_test[['comp_id']], image_index_df], axis=1)

# Merge all test set tables.
p_final_df = pd.merge(pd.merge(p_meta_final_df, p_image_final_df, on = 'comp_id'),
pd.merge(desc_test, title_test, on = 'comp_id'), on = 'comp_id')
p_final_df.shape

This transformation is the same as what we used on our training data, so it should be self-explanatory.

After merging all our public set data frames, we get a dimension of (986, 1388).

Building the XGBoost Model

Now for the fun part — it’s time to start building our machine learning model!

The standardized way of training models is to take your train set, split it into a new training and validation set. This is done in order to prevent overfitting on our training data.

Overfitting is where you train your model on the training data, and it picks up all the noise in training data. This causes the model to not generalize well when it is applied on new data.

We train our model with our training data then fit it on our validation data to get a rough idea of how well our model is performing. We’ll also fine tune the parameters in our model or perform more data cleaning as needed.

When we are satisfied with our model accuracy, we apply it on our final testing data and submit it to the data science competition.

Train_test_split

Let’s go ahead and do that with our final_df train set:

# Convert dataframe to numpy array.
X = final_df.drop(['comp_id', 'views'], axis=1).to_numpy()
y = final_df.loc[:, 'views'].to_numpy()

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 121)
print('Training set shape ', X_train.shape)
print('Test set shape ', X_test.shape)

X in the code above represents our features, and y is our target variable (or views). We convert the data frames into numpy arrays for optimization and to turn them into a data matrix for the XGBoost model training.

Next. we split the numpy arrays using the train_test_split() function, where test_size is the percentage of our test (25% in this case) and random_state stands for a reproducible number (any number) to reproduce the result of the split.

Training set shape (2250, 1386) 
Test set shape (750, 1386)

Here, we can see that we have a 0.75 train and 0.25 test split.

Data matrix

Now we build a data matrix, which is an internal data structure used in XGBoost models for efficiency.

trlabel = y_train
telabel = y_test

dtrain = xgb.DMatrix(X_train, label=trlabel)
dtest  = xgb.DMatrix(X_test, label=telabel)

To create a DMatrix, we fit the our data with the features (X_train and X_test), and we tell the function that our labels are the target variables in y_train and y_test (which is the views column).

Setting Parameters

Next, we define the parameters for our XGBoost models in dictionary formats.

# Set parameters.
param = {'max_depth': 7, 
         'eta': 0.2,
         'objective': 'reg:squarederror',
         'nthread': 5,
         'eval_metric': 'rmse'
        }

evallist = [(dtest, 'eval'), (dtrain, 'train')]

max_depth = Maximum depth of a decision tree
eta = learning rate for our model
nthread = Number of parallel threads used to run XGBoost
eval_metric = Evaluation metrics for validation data

The eval list variable is to specify what you want the output to be named.

To find more parameters for XGBoost models, head over to the official docs.

Train the model

# Train the model.
num_round = 70
bst = xgb.train(param, dtrain, num_round, evallist)

We set the number of rounds/trees for the model as 70, which means we will be generating 70 decision trees and use the best-performing one based on our evaluation metric, which is the root mean squared error (RSME).

To train our model, we will fit the arguments: parameters, training set, number of rounds and our evaluation list to the xgb.train function.

You should be seeing something like this when the model is training. Notice how the RMSE is decreasing each time? That’s the gradient descent algorithm in action!

Out-of-sample Error

Next, we make predictions on the test data based on our trained XGBoost model and compute the RMSE.

# Make prediction.
ypred = bst.predict(dtest).round()

# Compute RMSE on test set.
mse_xgboost = mean_squared_error(y_test, ypred)
rmse_xgboost = math.sqrt(mse_xgboost)

print('RMSE with XGBoost', rmse_xgboost)

Output: RMSE with XGBoost 1133.909102177066.

Fit model to public data (test data)

d_public = xgb.DMatrix(p_final_df.loc[:, p_final_df.columns != 'comp_id'][bst.feature_names])
solution = bst.predict(d_public).round()
solution_df = pd.concat([p_final_df[['comp_id']], pd.DataFrame(solution, columns = ['views'])], axis=1)
solution_df.to_csv('solution.csv', index=False)

We create a similar data matrix, ignore comp_id and subset based on the features used to train our model.

Next, we make a prediction on the public dataset. Our solution variable is a matrix of views that we predicted using our XGBoost model.

The final step is to build our data frame by concatenating our comp_id and the views we predicted. The end product is something like this:

And with this, we’re ready to publish the submission file and upload it to the Bitgrit competition!

Tips for improving model

After following this guide, you now have a working model that is making predictions. But say that you want to improve its accuracy. For that, I recommend the following next steps:

1. Hyper-parameter Tuning

When training models, tuning the parameters is essential to improve your model performance. In competitions, even the slightest improvements in accuracy (0.001) can be what you need to climb up the leaderboard.

Here are two awesome articles that are about tuning XGBoost models:

2. Try other algorithms

There are tons of other algorithms that you can try out besides XGBoost. A couple of popular alternatives are CatBoost and Light GBM.

You can also try out deep learning techniques using the popular libraries TensorFlow and PyTorch.

If you want to implement this XGBoost model yourself, check out bitgrit’s online AI competitions. These competitions are a great way to get hands-on with data and experiment with different machine learning algorithms, so don’t miss out on this challenge!

Programming