Back to blog

Feature Engineering 101

DATA SCIENCE

Common Feature Preprocessing and Generation Techniques in Data Science

“Cosmic Cliffs” in the Carina Nebula (NIRCam Image) IMAGE: NASA, ESA, CSA, STScI

One cannot just take the features in a dataset, fit their favorite model, and expect great results.

Real-world data is noisyredundant, and often unreliable.

Data Preprocessing

This makes data pre-processing a crucial step in the machine learning pipeline — which involves feature preprocessing and generation.

Each type of feature in a data set has its own way of preprocessing depending on its data type and the model used.

Different Features and Models

For example, a feature for user reviews would be processed differently than for the number of sales. One is text data, while the other is numerical.

A target that has nonlinear dependencies on a feature will not work for linear models but works for a tree model, i.e., random forest.

Proper preprocessing can improve the quality of your dataset, which can immensely affect model performance.

In this article, I will share the techniques for feature preprocessing and generation for different types of features.

Namely:

  1. Numerical
  2. Categorical
  3. DateTime & coordinates
  4. Text
  5. Images

The article will also highlight the differences between tree-based and non-tree-based models.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

Let’s dive in.

Code for this article 👇

Numeric Features

Preprocessing

1. Scaling

Source: How to Win a Data Science Competition: Learn from Top Kagglers

Why: Once scaled, the feature impact on non-tree-based models will be roughly similar.

Tree-based models do not depend on scaling as it doesn’t affect how it splits the variables

Non-tree-based hugely depend on scaling,

  • KNN — favor features that are closer by distance.
  • Linear models — Regularization impact is proportional to feature scale
  • Neural Networks — use gradient descent methods and can go crazy without proper feature scaling

What: Converts all features to the same scale

MethodsMinMaxScaling or StandardScaling

2. Winsorization

source

Why: Outliers will cause linear models to predict abnormal values.

What: Clip features between two chosen values of lower bound and upper bound ex: 1st and 99th percentiles

Methodsnp.percentile & np.clip

3. Rank Transformation

Why: It’s a better option than MinMaxScaling in the presence of outliers, also saves time from handling outliers

What: sets space between sorted values to be equal

💡 to apply to test data, store the created mapping from feature to rank values OR concatenate train/test before applying the rank transformation

Methodsscipy.stats.rankdata

4. Log transform and raise to power < 1

Why: Drives two big values closer to the feature’s average value, and values near zero become more distinguishable

What: Log and exponent transformation

Methods: np.log() & np.sqrt()

💡 (1) Train model on concatenated data frames produced by different preprocessing. (2) Mix models that were trained on different preprocessed data

Feature Generation

> Creating new features using knowledge about the feature and the task. We can engineer these features using prior knowledge and logic.

Examples:

  • squared area & price → price/1m² (unit price)
  • horizontal distance & vertical distance → Hypotenuse distance (Pythagoras Theorem)
  • price → fractional_part, ex: 2.49, 0.99, 7.07 → 0.49, 0.99, 0.07

💡 Operations +, -, *, / don’t only benefit linear models. GBMs sometimes have issues approximating divisions and multiplications, and adding them explicitly can help reduce the number of trees

So, the key to feature generation:

Creativity + data understanding

Code

import pandas as pd
import numpy as np

# scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
X_standard = StandardScaler().fit_transform(X)
X_minmax = MinMaxScaler().fit_transform(X)

# Winsorization for outliers
UB, LB = np.percentile(X, [1, 99]) # 1st and 99th percentile
y = np.clip(X, UB, LB)

# rank transfomration
from scipy.stats import rankdata
rankdata([-100, 0, 1e5]) # [1., 2., 3.]
 
# log transform
np.log(1 + X)

# raising to power
np.sqrt(X + 2/3)
np.power(X + 2/3, 1/2) # same as above

# feature generation
df['unit_price'] = df['price'] / df['area']
df['decimal'] = (np.modf(df['dollars'])[0] * 100).astype(int)

Categorical & Ordinal Features

Quick Refresher:

Categorical: no intrinsic ordering, ex: yes or no, colors, type of car
Ordinal: has a clear & meaningful ordering, ex: low, medium, high

Tree-based models can split categorical features and extract most of the useful values in categories on their own.

Non-tree-based models, on the other hand, cannot, and you have to do some extra work for them to use the features effectively.

Below are the options.

1. Label Encoding

What: map categories to numbers

Methods:

  1. Alphabetical: sklearn.preprocessing.LabelEncoder
  2. Order of appearance pandas.factorize

2. Frequency encoding

What: map categories to frequencies

Pros:

  • Preserve information about values distribution
  • Tree-based: less number of splits
  • Non-tree-based models: If the frequency is correlated with the target value, the linear model will utilize this dependency

Cons: If you have multiple categories with the same frequency, they won’t be distinguishable. (Solution: Use rank transformation to deal with such ties.)

Methods: groupby , sizelenmap

3. One-hot encoding

What: Form new columns for each unique value in categorical columns

Source: How to Win a Data Science Competition: Learn from Top Kagglers

Pros: Features are already scaled

Cons: For categorical features with many unique values, it can slow down for tree methods (Solution: Use sparse matrices)

Methods:

  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder

Feature Generation

One of the most useful examples is feature interactions between categorical features. And it’s mostly useful for non-tree-based models.

Here’s an example below.

The features pclass and sex are concatenated into one column — pclass_sex

Week 1 of How to Win a Data Science Competition: Learn from Top Kagglers

This way, a linear model can find the optional coefficient for every interaction of the features.

Takeaways

  • Label & Frequency encodings are often used for tree-based models
  • One-hot encoding is often used for non-tree-based models, and for these models, the interaction of categorical features is useful.

Code

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

## label encoding
lab = LabelEncoder()
df['column'] = lab.fit_transform(df['column'])


## frequency encoding
encoding = (df.groupby('column').size()) / len(train)
df['encoding'] = df['column'].apply(lambda x : encoding[x])


## one-hot encoding
df = pd.get_dummies(df, columns = ['col1', 'col2'])

## feature generation
df['col1_col2'] = df['col1'] + df['col2']

Datetime and coordinates

Datetime feature has many tiers and can be divided into two main categories: time moments and time since an event.

1. Periodicity

Time moments in a period can include day numbers in a week, month, season, year, second, minute, hour, etc.

Extracting these features is useful for capturing repetitive patterns in the data.

2. Time since

  1. Row-independent moment. Ex: since 1 January 1970
  2. Row dependent important moment. Ex: number of days left until the next holiday

Below is an example of time moments and time since in a real-world example.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

3. Difference between dates

Below is an example of a date difference for a churn prediction case.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

From these methods, you’ll end up with numerical features (time passed since 2020) and categorical features like a day of the week. You’ll have to apply the preprocessing steps mentioned previously.

Coordinates

Let’s say you’re predicting house rental prices and have access to geographical data. Here’s what you can do.

1. Distances

Source: How to Win a Data Science Competition: Learn from Top Kagglers

If you have access to such data, you can calculate the distance to important points on the map — hospitals, attractions, best schools in the neighborhood, etc.

If not, you could also organize data points into clusters and then use the centers of clusters as the important points

2. Aggregated statistics

Source: How to Win a Data Science Competition: Learn from Top Kagglers

Take an area, and calculate statistics around a particular point.

For example, the number of flats or mean rental price around the area.

Code

# pip install holidays
import holidays
import datetime
us_holidays = holidays.US()

# ex: 2016-01-02 01:00:00
df['weekday'] = df.Date.dt.weekday
df['year'] = df.Date.dt.year
df['quarter'] = df.Date.dt.quarter
df['weekofyear'] = df.Date.dt.weekofyear
df['dayofweek'] = df.Date.dt.dayofweek
df['dayofweek_name'] = df.Date.dt.weekday_name
df['month'] = df.Date.dt.month
df['day'] = df.Date.dt.day
df['hour'] = df.Date.dt.hour
df['second'] = df.Date.dt.second
df['minute'] = df.Date.dt.minute

df["is_holiday"] = df.Date.dt.floor('d').isin(us_holidays)
df['is_weekend'] = np.where(df['dayofweek_name'].isin(['Sunday','Saturday']), 1, 0)

# time since today
df['time_since'] = datetime.datetime.today() - df.Date

# difference bewteen two dates
df['diff_time'] = (df['date_1'] - df['date_2'])
df['diff_days'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'D')
df['diff_weeks'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'W')
df['diff_months'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'M')
df['diff_years'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'Y')

Missing Values

Missing values can come in many forms — empty strings and outliers such as -999.

Sometimes they can contain useful information, such as the reason for missing values occurring at this point in the data.

There are four types of missing data, and you should figure out which type can motivate your imputation strategy.

  1. Structurally Missing Data
  2. MCAR (Missing Completely at Random)
  3. MAR (Missing at Random)
  4. NMAR (Not Missing at Random)

Hidden NaNs

Source: How to Win a Data Science Competition: Learn from Top Kagglers

Sometimes missing values can be hidden, and we must plot the data to find them.

For example, in the left plot above, it’s clear that -1 is a missing value as the data distribution is between 0 and 1.

In the right plot, the peak tells us that it has been filled with the mean value.

Imputation approaches

Here are some approaches to filling missing values.

1. replace with -999, -1, etc

Pros: Gives tree models the possibility to take missing values into a separate category
Cons: performance of the non-tree-based model can suffer

2. mean, median

Pros: Beneficial for non-tree-based model
Cons: Bad for the tree-based model as it’s hard to select objects which had missing values in the first place

3. Isnull feature

Pros: solves problems for both tree and non-tree-based model
Cons: x2 number of columns

4. reconstruct the value

When there is a logical order, it is easy to reconstruct, ex: time series data can be imputed based on nearby values.)

💡Avoid filling NaNs before feature generation as it can cause features to be less useful.

More on working with missing data.

Code

import numpy as np

# replace with nan
df.replace({"-":np.nan, "?":np.nan}, inplace=True)

# replace with -999
df["column"].fillna(-999, inplace = True)

# replace with mean/median
df["age"].fillna(df["age"].mean(), inplace=True)
df["age"].fillna(df["age"].median(), inplace=True)

# other methods
df.fillna(method ='pad') # forward fill
df.fillna(method ='bfill') # backward fill
df.interpolate(method ='linear') # linear interpolation

Text -> Vectors

Text data often appears along with numerical and categorical features, here’s how to utilize them.

First, they have to be cleaned before they are turned into numbers.

Text preprocessing

1. lowercase

ex: “HI THERE” → “hi there”

2. lemmatization and stemming:

Used to reduce inflectional forms and sometimes derivationally related word forms to a common base form.

ex: democracy, democratic, democratization →
Lemmatization: democracy
Stemming: democr

3. stopwords removal

stopwords=words that do not contain important information for our model

They are articles or prepositions and are very common words.

ex: I, me, my, myself, we, our, ours

Once the text is preprocessed, it’s time to turn them into vectors.

Bag of words

What: Counts the number of occurrences of unique terms

Source: How to Win a Data Science Competition: Learn from Top Kagglers

As seen in the example above, every row is for a unique word, and every column is for a new document/text

Methodsklearn.feature_extraction.text.CountVectorizer

For non-tree-based models, they depend on the scaling of features. Post-processing methods on text can help make samples more comparable and boost more important features while decreasing the scale of useless ones.

A popular method is TFiDF.

TFiDF

Term Frequency — normalizes the sum of values in a row, counting frequencies and not occurrences of words. This makes the features more comparable.

As seen above, the occurrences switched to frequencies, and the values in each row sum up to one.

Inverse document frequency —normalizes column-wise by the inverse fraction of documents. This means more frequent words are scaled-down compared to rarer words, boosting more important features.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

As you can see, the word “excited” appears in all documents, so the tfidf transformation scaled down the importance of that feature.

Methodsklearn.feature_extraction.text.TfidfVectorizer

N-grams

N-grams is a contiguous sequence of n words.

The idea is to have n consequence words as features instead of singular, unique terms.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

Building N-grams features can help utilize the local context around each word, and it can be cheaper than treating every unique word as a feature compared to n-grams.

N-gram features are also typically sparse, as they deal with counts of word occurrences, and not every word can be found in a document.

For example, if we count occurrences of words from an English dictionary in our everyday speech, many words won’t be there, that is sparsity.

Word2vec

Word2Vec, a popular method of creating word embeddings, has many applications such as text similarity, classification, sentiment analysis, etc.

Word embeddings provide an efficient and dense representation of words in which similar words have a similar encoding.

Below is a popular example.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

We have the words “King”“Man”“Queen”, and “Woman”.

To use, we can infer that something these words have in common is gender. With embeddings, Word2Vec can capture this hidden relationship between the words, as shown below.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

BOW vs. Word2vec

BOW

  • very large vectors
  • meaning of each value in the vector is known
  • no. of features usually equal to the number of unique words

Word2Vec

  • Relatively small vectors
  • values in the vector are interpretable only in some cases
  • words with similar meanings often have similar embeddings (closer to each other)
  • number of features restricted to a constant

Code

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

# bag of words
# Transforms text into a sparse matrix of n-gram counts.
vect = CountVectorizer()
word_counts = vect.fit_transform(corpus)

# tfidf
# Transform a count matrix to a normalized tf or tf-idf representation.
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(word_counts)
tfidf

# BOW + TFIDF
# Convert a collection of raw documents to a matrix of TF-IDF features
vect = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# ngrams
from nltk import ngrams
sentence = "this is a sentence"
bigrams = ngrams(sentence.split(), 2)
trigrams = ngrams(sentence.split(), 3)

# word2vec
import gensim
from gensim.models import Word2Vec

# CBOW approach by default
model = gensim.models.Word2Vec(corpus, min_count = 1,
                              vector_size = 100, window = 5, sg=0)

# skipgram
model = gensim.models.Word2Vec(corpus, min_count = 1,
                              vector_size = 100, window = 5, sg=1)

Images -> vectors

Convolutional neural networks are the superior neural network for images, speech, or audio signal inputs.

Like word2vec, it can give a compressed representation of images.

Descriptors

In neural networks, we have inputs and outputs. The output for the last layer gives us the predictions that we want. However, we also have outputs from the neural network’s inner (hidden) layers. We call these outputs descriptors.

Descriptors from later layers in the neural network are useful for solving tasks similar to the one the network was trained on.

On the other hand, descriptors from early layers have more task-independent information.

For example, if your network was trained on image datasets such as ImageNet, you can use its last layer representation in some object classification tasks.

But if you want to use your network to classify medical notes, it is better to use an earlier layer or train the network from scratch.

The image below shows the VGG-16 architecture, which was trained on 1000 classes from VGG ResNet. You can tell looking at the value at the very end — 1x1x1000

Source: How to Win a Data Science Competition: Learn from Top Kagglers

This network has descriptors that have learned information about a specific task, if we wanted to apply this network on a new, smaller dataset, we could do that by slightly tuning the network, a fine-tuning process.

Fine-tuning

By using a pre-trained model, we can use the knowledge already encoded in the network parameters, which can lead to better results and a faster retraining procedure.

Source: How to Win a Data Science Competition: Learn from Top Kagglers

For the VGG-16 network, fine-tuning was done by removing the last layer and replacing it with a size of four. The learning rate is also decreased by 1000x than the initial learning rate.

Image augmentation

A lot of data is needed to build a good model, and often times data isn’t enough.

One way to increase it is by performing image augmentation.

Image augmentation is a process of creating new training examples from existing ones.

For example, you could augment an image by making it a little brighter, crop it, mirroring the image, changing the contrast, etc.

Below is an example of image augmentation. These six transformations alone can increase the size of a dataset by six times.

Source: https://albumentations.ai/docs/introduction/image_augmentation/

Another benefit of having more data is avoiding overfitting, allowing you to train more robust models with better results.

Resources

Numerical & Categorical

Datetime and coordinates

Text

Images

Missing values

Thanks for reading!

Liked this article? Here are three articles you may like:

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!


Comments

Comments are closed.