DATA SCIENCE
Common Feature Preprocessing and Generation Techniques in Data Science
One cannot just take the features in a dataset, fit their favorite model, and expect great results.
Real-world data is noisy, redundant, and often unreliable.
Data Preprocessing
This makes data pre-processing a crucial step in the machine learning pipeline — which involves feature preprocessing and generation.
Each type of feature in a data set has its own way of preprocessing depending on its data type and the model used.
Different Features and Models
For example, a feature for user reviews would be processed differently than for the number of sales. One is text data, while the other is numerical.
A target that has nonlinear dependencies on a feature will not work for linear models but works for a tree model, i.e., random forest.
Proper preprocessing can improve the quality of your dataset, which can immensely affect model performance.
In this article, I will share the techniques for feature preprocessing and generation for different types of features.
Namely:
- Numerical
- Categorical
- DateTime & coordinates
- Text
- Images
The article will also highlight the differences between tree-based and non-tree-based models.
Let’s dive in.
Code for this article 👇
Numeric Features
Preprocessing
1. Scaling
Why: Once scaled, the feature impact on non-tree-based models will be roughly similar.
Tree-based models do not depend on scaling as it doesn’t affect how it splits the variables
Non-tree-based hugely depend on scaling,
- KNN — favor features that are closer by distance.
- Linear models — Regularization impact is proportional to feature scale
- Neural Networks — use gradient descent methods and can go crazy without proper feature scaling
What: Converts all features to the same scale
Methods: MinMaxScaling
or StandardScaling
2. Winsorization
Why: Outliers will cause linear models to predict abnormal values.
What: Clip features between two chosen values of lower bound and upper bound ex: 1st and 99th percentiles
Methods: np.percentile
& np.clip
3. Rank Transformation
Why: It’s a better option than MinMaxScaling
in the presence of outliers, also saves time from handling outliers
What: sets space between sorted values to be equal
💡 to apply to test data, store the created mapping from feature to rank values OR concatenate train/test before applying the rank transformation
Methods: scipy.stats.rankdata
4. Log transform and raise to power < 1
Why: Drives two big values closer to the feature’s average value, and values near zero become more distinguishable
What: Log and exponent transformation
Methods: np.log()
& np.sqrt()
💡 (1) Train model on concatenated data frames produced by different preprocessing. (2) Mix models that were trained on different preprocessed data
Feature Generation
> Creating new features using knowledge about the feature and the task. We can engineer these features using prior knowledge and logic.
Examples:
- squared area & price → price/1m² (unit price)
- horizontal distance & vertical distance → Hypotenuse distance (Pythagoras Theorem)
- price → fractional_part, ex:
2.49, 0.99, 7.07
→0.49, 0.99, 0.07
💡 Operations +, -, *, /
don’t only benefit linear models. GBMs sometimes have issues approximating divisions and multiplications, and adding them explicitly can help reduce the number of trees
So, the key to feature generation:
Creativity + data understanding
Code
import pandas as pd
import numpy as np
# scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
X_standard = StandardScaler().fit_transform(X)
X_minmax = MinMaxScaler().fit_transform(X)
# Winsorization for outliers
UB, LB = np.percentile(X, [1, 99]) # 1st and 99th percentile
y = np.clip(X, UB, LB)
# rank transfomration
from scipy.stats import rankdata
rankdata([-100, 0, 1e5]) # [1., 2., 3.]
# log transform
np.log(1 + X)
# raising to power
np.sqrt(X + 2/3)
np.power(X + 2/3, 1/2) # same as above
# feature generation
df['unit_price'] = df['price'] / df['area']
df['decimal'] = (np.modf(df['dollars'])[0] * 100).astype(int)
Categorical & Ordinal Features
Quick Refresher:
Categorical: no intrinsic ordering, ex: yes or no, colors, type of car
Ordinal: has a clear & meaningful ordering, ex: low, medium, high
Tree-based models can split categorical features and extract most of the useful values in categories on their own.
Non-tree-based models, on the other hand, cannot, and you have to do some extra work for them to use the features effectively.
Below are the options.
1. Label Encoding
What: map categories to numbers
Methods:
- Alphabetical:
sklearn.preprocessing.LabelEncoder
- Order of appearance
pandas.factorize
2. Frequency encoding
What: map categories to frequencies
Pros:
- Preserve information about values distribution
- Tree-based: less number of splits
- Non-tree-based models: If the frequency is correlated with the target value, the linear model will utilize this dependency
Cons: If you have multiple categories with the same frequency, they won’t be distinguishable. (Solution: Use rank transformation to deal with such ties.)
Methods: groupby
, size
, len
, map
3. One-hot encoding
What: Form new columns for each unique value in categorical columns
Pros: Features are already scaled
Cons: For categorical features with many unique values, it can slow down for tree methods (Solution: Use sparse matrices)
Methods:
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder
Feature Generation
One of the most useful examples is feature interactions between categorical features. And it’s mostly useful for non-tree-based models.
Here’s an example below.
The features pclass
and sex
are concatenated into one column — pclass_sex
This way, a linear model can find the optional coefficient for every interaction of the features.
Takeaways
- Label & Frequency encodings are often used for tree-based models
- One-hot encoding is often used for non-tree-based models, and for these models, the interaction of categorical features is useful.
Code
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
## label encoding
lab = LabelEncoder()
df['column'] = lab.fit_transform(df['column'])
## frequency encoding
encoding = (df.groupby('column').size()) / len(train)
df['encoding'] = df['column'].apply(lambda x : encoding[x])
## one-hot encoding
df = pd.get_dummies(df, columns = ['col1', 'col2'])
## feature generation
df['col1_col2'] = df['col1'] + df['col2']
Datetime and coordinates
Datetime feature has many tiers and can be divided into two main categories: time moments and time since an event.
1. Periodicity
Time moments in a period can include day numbers in a week, month, season, year, second, minute, hour, etc.
Extracting these features is useful for capturing repetitive patterns in the data.
2. Time since
- Row-independent moment. Ex: since 1 January 1970
- Row dependent important moment. Ex: number of days left until the next holiday
Below is an example of time moments and time since in a real-world example.
3. Difference between dates
Below is an example of a date difference for a churn prediction case.
From these methods, you’ll end up with numerical features (time passed since 2020) and categorical features like a day of the week. You’ll have to apply the preprocessing steps mentioned previously.
Coordinates
Let’s say you’re predicting house rental prices and have access to geographical data. Here’s what you can do.
1. Distances
If you have access to such data, you can calculate the distance to important points on the map — hospitals, attractions, best schools in the neighborhood, etc.
If not, you could also organize data points into clusters and then use the centers of clusters as the important points
2. Aggregated statistics
Take an area, and calculate statistics around a particular point.
For example, the number of flats or mean rental price around the area.
Code
# pip install holidays
import holidays
import datetime
us_holidays = holidays.US()
# ex: 2016-01-02 01:00:00
df['weekday'] = df.Date.dt.weekday
df['year'] = df.Date.dt.year
df['quarter'] = df.Date.dt.quarter
df['weekofyear'] = df.Date.dt.weekofyear
df['dayofweek'] = df.Date.dt.dayofweek
df['dayofweek_name'] = df.Date.dt.weekday_name
df['month'] = df.Date.dt.month
df['day'] = df.Date.dt.day
df['hour'] = df.Date.dt.hour
df['second'] = df.Date.dt.second
df['minute'] = df.Date.dt.minute
df["is_holiday"] = df.Date.dt.floor('d').isin(us_holidays)
df['is_weekend'] = np.where(df['dayofweek_name'].isin(['Sunday','Saturday']), 1, 0)
# time since today
df['time_since'] = datetime.datetime.today() - df.Date
# difference bewteen two dates
df['diff_time'] = (df['date_1'] - df['date_2'])
df['diff_days'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'D')
df['diff_weeks'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'W')
df['diff_months'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'M')
df['diff_years'] = (df['date_1'] - df['date_2']) / np.timedelta64(1, 'Y')
Missing Values
Missing values can come in many forms — empty strings and outliers such as -999.
Sometimes they can contain useful information, such as the reason for missing values occurring at this point in the data.
There are four types of missing data, and you should figure out which type can motivate your imputation strategy.
- Structurally Missing Data
- MCAR (Missing Completely at Random)
- MAR (Missing at Random)
- NMAR (Not Missing at Random)
Hidden NaNs
Sometimes missing values can be hidden, and we must plot the data to find them.
For example, in the left plot above, it’s clear that -1 is a missing value as the data distribution is between 0 and 1.
In the right plot, the peak tells us that it has been filled with the mean value.
Imputation approaches
Here are some approaches to filling missing values.
1. replace with -999, -1, etc
Pros: Gives tree models the possibility to take missing values into a separate category
Cons: performance of the non-tree-based model can suffer
2. mean, median
Pros: Beneficial for non-tree-based model
Cons: Bad for the tree-based model as it’s hard to select objects which had missing values in the first place
3. Isnull feature
Pros: solves problems for both tree and non-tree-based model
Cons: x2 number of columns
4. reconstruct the value
When there is a logical order, it is easy to reconstruct, ex: time series data can be imputed based on nearby values.)
💡Avoid filling NaNs before feature generation as it can cause features to be less useful.
More on working with missing data.
Code
import numpy as np
# replace with nan
df.replace({"-":np.nan, "?":np.nan}, inplace=True)
# replace with -999
df["column"].fillna(-999, inplace = True)
# replace with mean/median
df["age"].fillna(df["age"].mean(), inplace=True)
df["age"].fillna(df["age"].median(), inplace=True)
# other methods
df.fillna(method ='pad') # forward fill
df.fillna(method ='bfill') # backward fill
df.interpolate(method ='linear') # linear interpolation
Text -> Vectors
Text data often appears along with numerical and categorical features, here’s how to utilize them.
First, they have to be cleaned before they are turned into numbers.
Text preprocessing
1. lowercase
ex: “HI THERE” → “hi there”
2. lemmatization and stemming:
Used to reduce inflectional forms and sometimes derivationally related word forms to a common base form.
ex: democracy, democratic, democratization →
Lemmatization: democracy
Stemming: democr
3. stopwords removal
stopwords=words that do not contain important information for our model
They are articles or prepositions and are very common words.
ex: I, me, my, myself, we, our, ours
Once the text is preprocessed, it’s time to turn them into vectors.
Bag of words
What: Counts the number of occurrences of unique terms
As seen in the example above, every row is for a unique word, and every column is for a new document/text
Method: sklearn.feature_extraction.text.CountVectorizer
For non-tree-based models, they depend on the scaling of features. Post-processing methods on text can help make samples more comparable and boost more important features while decreasing the scale of useless ones.
A popular method is TFiDF.
TFiDF
Term Frequency — normalizes the sum of values in a row, counting frequencies and not occurrences of words. This makes the features more comparable.
As seen above, the occurrences switched to frequencies, and the values in each row sum up to one.
Inverse document frequency —normalizes column-wise by the inverse fraction of documents. This means more frequent words are scaled-down compared to rarer words, boosting more important features.
As you can see, the word “excited” appears in all documents, so the tfidf transformation scaled down the importance of that feature.
Method: sklearn.feature_extraction.text.TfidfVectorizer
N-grams
N-grams is a contiguous sequence of n words.
The idea is to have n consequence words as features instead of singular, unique terms.
Building N-grams features can help utilize the local context around each word, and it can be cheaper than treating every unique word as a feature compared to n-grams.
N-gram features are also typically sparse, as they deal with counts of word occurrences, and not every word can be found in a document.
For example, if we count occurrences of words from an English dictionary in our everyday speech, many words won’t be there, that is sparsity.
Word2vec
Word2Vec, a popular method of creating word embeddings, has many applications such as text similarity, classification, sentiment analysis, etc.
Word embeddings provide an efficient and dense representation of words in which similar words have a similar encoding.
Below is a popular example.
We have the words “King”
, “Man”
, “Queen”
, and “Woman”
.
To use, we can infer that something these words have in common is gender. With embeddings, Word2Vec can capture this hidden relationship between the words, as shown below.
BOW vs. Word2vec
BOW
- very large vectors
- meaning of each value in the vector is known
- no. of features usually equal to the number of unique words
Word2Vec
- Relatively small vectors
- values in the vector are interpretable only in some cases
- words with similar meanings often have similar embeddings (closer to each other)
- number of features restricted to a constant
Code
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
# bag of words
# Transforms text into a sparse matrix of n-gram counts.
vect = CountVectorizer()
word_counts = vect.fit_transform(corpus)
# tfidf
# Transform a count matrix to a normalized tf or tf-idf representation.
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(word_counts)
tfidf
# BOW + TFIDF
# Convert a collection of raw documents to a matrix of TF-IDF features
vect = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# ngrams
from nltk import ngrams
sentence = "this is a sentence"
bigrams = ngrams(sentence.split(), 2)
trigrams = ngrams(sentence.split(), 3)
# word2vec
import gensim
from gensim.models import Word2Vec
# CBOW approach by default
model = gensim.models.Word2Vec(corpus, min_count = 1,
vector_size = 100, window = 5, sg=0)
# skipgram
model = gensim.models.Word2Vec(corpus, min_count = 1,
vector_size = 100, window = 5, sg=1)
Images -> vectors
Convolutional neural networks are the superior neural network for images, speech, or audio signal inputs.
Like word2vec, it can give a compressed representation of images.
Descriptors
In neural networks, we have inputs and outputs. The output for the last layer gives us the predictions that we want. However, we also have outputs from the neural network’s inner (hidden) layers. We call these outputs descriptors.
Descriptors from later layers in the neural network are useful for solving tasks similar to the one the network was trained on.
On the other hand, descriptors from early layers have more task-independent information.
For example, if your network was trained on image datasets such as ImageNet, you can use its last layer representation in some object classification tasks.
But if you want to use your network to classify medical notes, it is better to use an earlier layer or train the network from scratch.
The image below shows the VGG-16 architecture, which was trained on 1000 classes from VGG ResNet. You can tell looking at the value at the very end — 1x1x1000
This network has descriptors that have learned information about a specific task, if we wanted to apply this network on a new, smaller dataset, we could do that by slightly tuning the network, a fine-tuning process.
Fine-tuning
By using a pre-trained model, we can use the knowledge already encoded in the network parameters, which can lead to better results and a faster retraining procedure.
For the VGG-16 network, fine-tuning was done by removing the last layer and replacing it with a size of four. The learning rate is also decreased by 1000x than the initial learning rate.
Image augmentation
A lot of data is needed to build a good model, and often times data isn’t enough.
One way to increase it is by performing image augmentation.
Image augmentation is a process of creating new training examples from existing ones.
For example, you could augment an image by making it a little brighter, crop it, mirroring the image, changing the contrast, etc.
Below is an example of image augmentation. These six transformations alone can increase the size of a dataset by six times.
Another benefit of having more data is avoiding overfitting, allowing you to train more robust models with better results.
Resources
Numerical & Categorical
Datetime and coordinates
Text
Images
Missing values
Thanks for reading!
Liked this article? Here are three articles you may like:
- Predicting Rain with Machine Learning
- Using Data Science to Predict Viral Tweets
- 40 Useful Pandas Snippets. Pandas snippets that come in handy in data analysis work
Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!
Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!
Comments are closed.