KantCodes.com

The Complete Guide to Encoding Categorical Features

Utkarsh Kant — Tue, 30 Jan 2024 06:00:37 GMT

Introduction

In the world of data analysis and machine learning, data comes in all shapes and sizes.

Categorical data is one of the most common forms of data that you will encounter in your data science journey. It represents discrete, distinct categories or labels, and it's an essential part of many real-world datasets.

In this article, we will discuss the best techniques to encode categorical features in great detail along with their code implementations. We will also discuss the best practices and how to select the right encoding technique.

The objective of this article is to serve as a ready reference for whenever you wish to encode categorical features in your dataset.

Why do we need to Encode Categorical Features?

Many machine learning algorithms require numerical input.

Categorical data, being non-numeric, needs to be transformed into a numerical format for these algorithms to work.

Types of Categorical Features

Categorical features are encoded based on the their types and functions. They can be broadly divided into two categories: Nominal and Ordinal.

Nominal Categorical Features

Nominal features are those where the categories have no inherent order or ranking.

For example, the colors of cars (red, blue, green) are nominal because there's no natural order to them.

Ordinal Categorical Features

Ordinal features are those where the categories have a meaningful order or rank.

Think of education levels (high school, bachelor's, master's, Ph.D.), for which there is a clear ranking.

Learn more about categorical data & other types of data from the below resource.

https://kantschants.com/data-complete-guide#heading-21-categorical-data

Challenges with Categorical Features

Categorical data brings its own set of challenges when it comes to data analysis and machine learning. Here are some key challenges:

Numerical Requirement: Many machine learning algorithms require numerical input. Categorical data, being non-numeric, needs to be transformed into a numerical format for these algorithms to work.
Curse of Dimensionality: One-hot encoding, a common technique, can lead to a high number of new columns (dimensions) in your dataset, which can increase computational complexity and storage requirements.
Multicollinearity: In one-hot encoding, the newly created binary columns can be correlated, which can be problematic for some models that assume independence between features.
Data Sparsity: When one-hot encoding is used, it can lead to sparse matrices, where most of the entries are zero. This can be memory-inefficient and affect model performance.

What we will cover today?

The encoding techniques we will discuss today are listed below:

Label Encoding
One-hot Encoding
Binary Encoding
Ordinal Encoding
Frequency Encoding or Count Encoding
Target Encoding or Mean Encoding
Feature Hashing or Hashing Trick

Let us discuss each in detail.

Label Encoding

Label encoding is one of the fundamental techniques for converting categorical data into a numerical format. It assigns numbers in increasing order to the labels in an ordinal categorical feature.

It is a simple yet effective method that assigns a unique integer to each category in a feature.

How it works?

Imagine a feature 'Size' that has the following labels: 'Small', 'Medium', and 'Large'. This is an ordinal categorical feature as there is an inherent order in the labels.

We can encode these labels as following:

Small → 0
Medium → 1
Large → 2

Code Implementation

Let us look at the code implementation for Label Encoding.

# necessary imports
from sklearn.preprocessing import LabelEncoder

# Sample data
data = ["Small", "Medium", "Large", "Medium", "Small"]
print(data)     # Output: ['Red', 'Green', 'Blue', 'Red', 'Green']

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit and transform the data
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)  # Output: [2, 1, 0, 2, 1]

When to Use Label Encoding?

Label encoding is a suitable choice for:

Ordinal data or features with a clear and meaningful order.
Not increasing dimensionality in the dataset.

One-Hot Encoding or Dummy Encoding

One-hot encoding, also popularly known as dummy encoding, is a widely used technique for converting categorical data into a numerical format.

It's particularly suitable for nominal categorical features, where the categories have no inherent order or ranking.

How it works?

One-hot encoding transforms each label (or category) in a categorical feature into a binary column.

Each binary column corresponds to a specific category and indicates the presence (1) or absence (0) of that category in the original feature.

For example, consider a categorical feature "Color" with three labels: "Red," "Green," and "Blue." One-hot encoding would create three binary columns like this:

"Red" → [1, 0, 0]
"Green" → [0, 1, 0]
"Blue" → [0, 0, 1]

Code Implementation

Let us look at the code implementation for One-Hot Encoding.

# necessary imports
import pandas as pd

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Perform one-hot encoding
encoded_data = pd.get_dummies(data, columns=['Color'])

The output will look like below:

Advantages of One-Hot Encoding

The primary advantage of one-hot encoding is that it maintains the distinctiveness of labels and prevents any unintended ordinality.

Each label becomes a separate feature, and the presence or absence of a category is explicitly represented.

When to Use?

One-hot encoding is an appropriate choice when:

Dealing with nominal data with no meaningful order among labels.
Maintaining the distinction between categories (or labels) is crucial, and no ordinality must be introduced.
It handles missing values the absence of a category results in all zeros in the one-hot encoded columns.

Challenges with one-hot encoding

Dummy Variable Trap 💡

Be aware of the "dummy variable trap," where multicollinearity can occur if one column can be predicted from the others.

To avoid this, you can safely drop one of the one-hot encoded columns, reducing the dimensionality by one. You can declare the drop_first=True in the get_dummies function as shown below.

import pandas as pd

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Perform one-hot encoding
encoded_data = pd.get_dummies(data, columns=['Color'], drop_first=True)

Output:

Curse of Dimensionality

One-hot encoding can lead to a high number of new columns (dimensions) in your dataset, which can increase computational complexity and storage requirements.

Multicollinearity

In one-hot encoding, the newly created binary columns can be correlated, which can be problematic for some models that assume independence between features.

Data Sparsity

When one-hot encoding is used, it can lead to sparse matrices, where most of the entries are zero. This can be memory-inefficient and affect model performance.

Binary Encoding

Binary encoding is a versatile technique for encoding categorical features, especially when dealing with high-cardinality data.

It combines the benefits of one-hot and label encoding while reducing dimensionality.

How it works?

Binary encoding works by converting each category into binary code and representing it as a sequence of binary digits (0s and 1s).

Each binary digit is then placed in a separate column, effectively creating a set of binary columns for each category.

The encoding process is as follows:

Assign a unique integer to each category, similar to label encoding.
Convert the integer to binary code.
Create a set of binary columns to represent the binary code.

For example, consider a categorical feature "Country" with categories "USA," "Canada," and "UK."

Binary encoding would involve assigning unique integers to each country (e.g., "USA" -> 1, "Canada" -> 2, "UK" -> 3) and then converting these integers to binary code. The binary digits (0s and 1s) are then placed in separate binary columns:

"USA" → 1 → 001 → [0, 0, 1]
"Canada" → 2 → 010 → [0, 1, 0]
"UK" → 3 → 100 → [1, 0, 0]

Code Implementation

Let us go through an example in Python.

# necessary imports
import category_encoders as ce
import pandas as pd

# Sample data
data = pd.DataFrame({'Country': ['USA', 'Canada', 'UK', 'USA', 'UK']})

# Initialize the binary encoder
encoder = ce.BinaryEncoder(cols=['Country'])

# Fit and transform the data
encoded_data = encoder.fit_transform(data)

The output is below:

Advantages

It combines the advantages of both one-hot encoding and label encoding, efficiently converting categorical data into a binary format.
It is memory efficient and overcomes the curse of dimensionality.
Finally, it is easy to implement and interpret.

When to Use?

Binary encoding is a suitable choice when:

Dealing with high-cardinality categorical features (features with a large number of unique categories).
You want to reduce the dimensionality compared to one-hot encoding, especially for features with many unique categories.

Ordinal Encoding

As the name suggests, Ordinal Encoding encodes the categories in an ordinal feature by mapping it to integer values in ascending order of rank.

How it Works?

The process of ordinal encoding involves mapping each category to a unique integer, typically based on their order or rank.

Consider an ordinal feature "Education Level" with categories: "High School," "Associate's Degree," "Bachelor's Degree," "Master's Degree," and "PhD".

Ordinal encoding will assign integer values as follows:

"High School" → 0
"Associate's Degree" → 1
"Bachelor's Degree" → 2
"Master's Degree" → 3
"PhD" → 4

These integer values reflect the ordinal relationship between the education levels.

Code Implementation

Here's how we implement Ordinal Encoding in Python.

# necessary imports
import category_encoders as ce
import pandas as pd

# Sample data
data = pd.DataFrame({"Education Level": ["High School", "Bachelor's Degree", "Master's Degree", "PhD", "Associate's Degree"]})

# Define the ordinal encoding mapping
education_mapping = {
    'High School': 0,
    "Associate's Degree": 1,
    "Bachelor's Degree": 2,
    "Master's Degree": 3,
    'PhD': 4
}

# Perform ordinal encoding
encoder = ce.OrdinalEncoder(mapping=[{'col': 'Education Level', 'mapping': education_mapping}])
encoded_data = encoder.fit_transform(data)

Output:

Advantages

It captures and preserves the ordinal relationships between categories, which can be valuable for certain types of analyses.
It reduces the dimensionality of the dataset compared to one-hot encoding.
It provides a numerical representation of the data, making it suitable for many machine learning algorithms.

When to Use Ordinal Encoding

Ordinal encoding is an appropriate choice when:

Dealing with categorical features that exhibit a clear and meaningful order or ranking.
Preserving the ordinal relationship among categories is essential for your analysis or model.
You want to convert the data into a numerical format while maintaining the inherent order of the categories.

Frequency Encoding or Count Encoding

Frequency encoding, also known as count encoding, is a technique that encodes categorical features based on the frequency of each category in the dataset.

This method assigns each category a numerical value representing how often it occurs. It's a straightforward approach that can be effective in certain scenarios.

Categories that appear more frequently receive higher values, while less common categories receive lower values. This provides a numerical representation of the categories based on their prevalence.

How it works?

The process involves mapping each category to its frequency or count within the dataset.

Consider a categorical feature "City" with categories "New York," "Los Angeles," "Chicago," and "San Francisco." If "New York" appears 50 times, "Los Angeles" 30 times, "Chicago" 20 times, and "San Francisco" 10 times, frequency encoding will assign values as follows:

"New York" → 50
"Los Angeles" → 30
"Chicago" → 20
"San Francisco" → 10

💡 NOTE

Frequency or Count Encoding is specially effective where the frequency of categories in a feature has a significant impact.

It should not be applied to ordinal categorical features.

Code Implementation

The implementation here is pretty straightforward.

# imports
import pandas as pd

# Sample data
data = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago', 'Chicago', 'New York', 'New York']})

# frequency encoding
frequency_encoding = data['City'].value_counts().to_dict()
data['Encoded_City'] = data['City'].map(frequency_encoding)

Output below:

Advantages of Frequency Encoding

Frequency encoding offers the following advantages:

It encodes categorical data in a straightforward and interpretable way, preserving the count information.
Particularly useful when the frequency of categories is a relevant feature for the problem you're solving.
It reduces dimensionality compared to one-hot encoding, which can be beneficial in high-cardinality scenarios.

When to Use Frequency Encoding

Frequency encoding is an appropriate choice when:

Analyzing categorical features where the frequency of each category is relevant information for your model.
Reducing the dimensionality of the dataset compared to one-hot encoding while preserving the information about category frequency.

Target Encoding or Mean Encoding

Target encoding, also known as Mean Encoding, is a powerful technique used to encode categorical features when the target variable is categorical.

It assigns a numerical value to each category based on the mean of the target variable within that category.

Target encoding is particularly useful in classification problems. It captures how likely each category is to result in the target variable taking a specific value.

How Target Encoding Works

The process of target encoding involves mapping each category to the mean of the target variable for data points within that category. This encoding method provides a direct relationship between the categorical feature and the target variable.

Consider a categorical feature "Region" with categories "North," "South," "East," and "West." If we're dealing with a binary classification problem where the target variable is "Churn" (0 for no churn, 1 for churn), target encoding might assign values as follows:

"North" → Mean of "Churn" for data points in the "North" category
"South" → Mean of "Churn" for data points in the "South" category
"East" → Mean of "Churn" for data points in the "East" category
"West" → Mean of "Churn" for data points in the "West" category

Code Implementation

Here's a Python code example for target encoding using the category_encoders library:

# imports
import category_encoders as ce
import pandas as pd

# Sample data
data = pd.DataFrame({'Region': ['North', 'South', 'East', 'West', 'North', 'South'], 
                     'Churn': [0, 1, 0, 1, 0, 1]})

# Perform target encoding
encoder = ce.TargetEncoder(cols=['Region'])
encoded_data = encoder.fit_transform(data, data['Churn'])

Output is shared below:

Best Practices

When using target encoding, consider the following best practices:

Be cautious about potential data leakage, as the mean of the target variable is used in the encoding process. Ensure you're not using information from the test or validation set when encoding.
Use cross-validation or other techniques to prevent overfitting and improve the robustness of target encoding.

Advantages of Target Encoding

Target encoding offers several advantages:

It captures the relationship between the categorical feature and the target variable, making it useful in classification problems.
It provides a direct and interpretable way to encode categorical features.
It reduces dimensionality compared to one-hot encoding while preserving valuable information about category-specific behavior.

When to Use Target Encoding

Target encoding is an appropriate choice when:

Working with categorical features and a categorical target variable in classification problems.
You want to capture the relationship between the categorical feature and the target variable, helping the model make predictions based on category-specific behavior.

Feature Hashing or Hashing Trick

A rather under-appreciated encoding technique, Feature Hashing, also known as the Hashing Trick, is a method used to encode high-cardinality categorical features efficiently.

It works by applying a hash function to the categorical data, reducing the dimensionality of the feature while still providing a numerical representation.

💡 Feature hashing is particularly useful when dealing with large datasets with many unique categories.

How Feature Hashing Works

The feature hashing process involves applying a hash function to the categorical data, which maps each category to a fixed number of numerical columns.

The hash function distributes the categories across these columns, and each category contributes to the values of multiple columns.

Code Implementation

Let's implement this in Python.

import category_encoders as ce
import pandas as pd

# Sample data
data = pd.DataFrame({'Product Category': ['A', 'B', 'C', 'A', 'C', 'D', 'E', 'D', 'C', 'A']})

# Perform feature hashing with three columns
encoder = ce.HashingEncoder(cols=['Product Category'], n_components=3)
encoded_data = encoder.fit_transform(data)

Output is shared below:

When to Use Feature Hashing

Feature hashing is an appropriate choice when:

Dealing with high-cardinality categorical features that have too many unique categories to handle using one-hot encoding or other techniques.
Reducing the dimensionality of the dataset while retaining the essential information from the categorical feature.
Memory and computational resources are limited, making it challenging to work with a high number of binary columns.
Define feature hashing.
Explain when to use feature hashing.
Provide code examples and implementation tips.
Discuss the impact of hash collisions.

Concluding thoughts

Here, we conclude the most useful encoding techniques for categorical variables for your data science and machine learning tasks.

Encoding data features is a crucial step in any machine learning pipeline and I hope that this article serves as a ready reference for all your upcoming projects.

Each technique has its strengths and is best suited for specific scenarios. Make sure to refer to the "When to Use" section for each encoding technique to apply the right feature encoding technique to your dataset.

The reference code is compiled for you in the notebook below.

https://github.com/utkarshkant/Helpful-Python/blob/master/Encoding_Categorical_Features.ipynb

Hope you enjoyed this!

Feel free to reach out for any queries or feedback below or on my socials.

LinkedIn

X

YouTube

Github

Assumptions of Linear Regression - Ace the Most Asked Interview Question

Utkarsh Kant — Fri, 11 Aug 2023 04:33:28 GMT

Introduction

Linear Regression is one of the most popular statistical models and machine learning algorithms. Considered the holy grail in the world of Data Science and Machine Learning.

It is one of the first (if not the first) algorithms that is thought in ML schools and courses alike.

However, one of the most important aspects that a lot of tutorials skip is that Linear Regression cannot be applied to all datasets alike. There are certain mandates that a dataset and its distribution must follow for Linear Regression to be successfully modeled to it.

These are popularly also known as the Assumptions of Linear Regression.

💡 Assumptions of Linear Regression model is a favorite interview question for the Data Scientist and Machine Learning Engineer positions.

In this article, we will not only list the different assumptions of a linear regression model but also discuss why they are so, and the rationale behind each of them.

The prerequisite for this discussion is a good understanding of the Linear Regression algorithm itself.

So let’s go! 🚀

A quick review of Linear Regression

We know that the Linear Regression model aims at establishing the best-fit line between the dependent and independent features of a dataset as shown below.

Figure: y = 3 + 5x + np.random.rand(100, 1)

The Linear Regression model is defined as follows.

Now, let us discuss the assumptions of the Linear Regression model.

Assumption of Linear Regression Model

The assumptions of Linear Regression are as follows:

Linearity
Homoscedasticity or Constant Error Variance
Independent Error Terms or No Autocorrelation
Normality of Residuals
No or Negligible Multi-collinearity
Exogeneity

💡 NOTE

Different sources and textbooks might list a different number of assumptions of a linear regression model. And they are all correct.

However, the 6 assumptions that we will discuss today shall cover all of the different assumptions.

Many textbooks break individual assumptions into multiple different assumptions, therefore, can list out about 10 different assumptions.

⭐ The significance of these assumptions can be understood as guidelines that if a dataset follows, becomes highly suitable for a Linear Regression model.

Alright! Let’s discuss each of these assumptions in detail.

1. Linearity

This essentially means that there must be a linear relationship between the dependent and the independent features of a dataset.

And this is fairly intuitive as the best-fit line of a linear regression model is a straight line, which is most suitable for linear data distribution.

Compare the two different distributions below:

Data is linearly distributed

Figure: y = 3 + 5x + np.random.rand(100, 1)
Data is non-linearly distributed

Figure: y = 3 + 50x^2 + np.random.rand(100, 1)

We can clearly distinguish between the two different distributions that the linear regression model is a better fit for the linear distribution.

How to detect linearity between dependent & independent features?

Well, one way is to plot the data and detect it visually. However, in real-world scenarios, it may not be so simple to detect linearity in data.

The Likelihood Ratio (LR) Test is a good test for establishing linearity.

2. Homoscedasticity or Constant Error Variance

The second assumption of linear regression is Homoscedasticity.

It means that the residuals (or error terms) should have constant variance along the axis, in other words, the error terms must be evenly spread across the axis as shown below.

Figure: The residuals for a linearly distributed dataset have constant variance.

There are instances where the residuals are not evenly spread along the axis, and this condition is known as Heteroscedasticity. A few examples are shown below.

Figure: Homoscedasticity vs Heteroscedasticity [Source]

When there is Heteroscedasticity in data, the standard errors cannot be relied upon and hence is a violation of the assumptions of Linear Regression.

How to detect Heteroscedasticity in data?

Apart from visually detecting it, there are statistical tests for determining Heteroscedasticity, the popular ones are:

Goldfeldt-Quant test
Breusch-Pagan test

How to remove Heteroscedasticity in data?

There are certain ways to remove Heteroscedasticity from your data, some of them are:

White’s standard errors: These are additive terms that sort of normalize the variance in the spread of residual terms, however, the downside is that the confidence in the coefficients of independent features also decreases.
Weighted least squares: Updating the weights of independent features in the Linear Regression equation. This is a trial-and-error method that may lead to Homoscedasticity.
Log transformations: Many times a curved distribution can be converted into a linear distribution (i.e., a straight line) by simply applying the log function to it. Other transformations may work out as well.

3. Independent Error Terms or No Autocorrelation

Here the assumption states that each residual term is not related to the other residual term occurring before or after it. A good example of this is shown below.

Figure: The residuals for a linearly distributed dataset are independent of each other.

💡 NOTE
Autocorrelation is the relation of the data series with itself, where the error term of the next data record is related to the residual of the previous data record.

It is most often found in time-series data and not so prevalent in regular cross-sectional datasets. An example of a time-series distribution is shown below.

Figure: Autocorrelation in time series data helps forecast future outcomes.

Therefore, it is not something that you may encounter very often, however, if you do it is a violation of the assumptions of linear regression.

With autocorrelation in the data, the standard error of the output becomes unreliable.

How to detect autocorrelation?

There are a few tests for detecting autocorrelation in a dataset. Here are a few:

ACF & PACF plots
Durbin-Watson test

4. Normality of Residuals

This assumption states that the residuals of errors in the model must be normally distributed.

If the normality of errors is violated and the number of records is small, then the standard errors in output are affected. That impacts the best-fit line of the model.

💡 NOTE
This assumption generally is considered a weak assumption for Linear Regression models and slight (or greater) violations can be neglected while modeling. This is particularly true for large datasets.

How to detect normality in errors?

There are multiple visual and statistical tests for detecting normality in error terms. Some of the popular ones are:

Histogram

Figure: Residuals are normally distributed [Source]
Q-Q Plot

Figure: Q-Q plot for normally distributed errors [Source]
Shapiro-Wilk test
Kolmogorov-Smirnov test
Anderson-Darling test

How to bring normality in errors?

As mentioned above, this is a weak assumption and can be neglected in many cases as well.

However, some ways to bring normality in residuals are:

Mathematical transformations like log transformations etc.
Standardization or normalization of the dataset
Adding more data reduces the need for normally distributed error terms

5. No Multi-collinearity

Multi-collinearity occurs when 2 or more features of a dataset are internally correlated with each other.

Consider a house price dataset with multiple variables about the property and price being the target variable. There is a high chance that the features 'floor area' and 'land dimensions' are highly correlated since the area is a direct multiple of individual dimensions.

Now, this is a problem for the regression model since what it effectively is trying to do is isolate the individual effects of each feature on the target variable. This is represented by the weights of each feature as shown below.

$$X = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$$

Therefore, it is highly recommended to verify that there is collinearity between individual features within a dataset.

How does this affect our model?

It disturbs the best-fit line by impacting the individual coefficients of the variables, which then becomes unreliable.

How to detect multicollinearity?

Calculating correlation (ρ) between each feature in the dataset.
Variance Inflation Factor (VIF)

How to remove multicollinearity?

Simply removing one of the correlated variables.
Merging them into a single feature can prevent multicollinearity.

⚠️ CAUTION!

Merging correlated features into a single feature will only work if the new feature actually has real-world existence or impacts the target variable equally.

6. Exogeneity (or No Endogeneity)

Exogeneity or no omitted variable bias is the final assumption on our list.

But let’s first understand what omitted variable bias actually is.

If there is a variable in the model that has been omitted and/or is not present but still impacts the target variable, then there is omitted variable bias or Endogeneity in the model.

For example, consider the following model.

$$UsedCarPrice_i = \beta_0 + \beta_1(DistanceTravelled)_i + \epsilon_i$$

Now the price of a used car here is determined by the distance it has already covered. However, the year of manufacturing also impacts both the target variable (Y), the car price of the used car, and the X variable, Distance traveled by car as the longer the age of the car, the more likely the car has traveled greater distances.

This is a clear case of omitted variable bias and it is undesirable for accurate modeling.

💡 NOTE
Exogeneity in a model tells us that all features that impact the target variable (Y) are part of the model features (X) and no other external feature can be further included.

Summary

So this was our discussion on the Assumptions of Linear Regression. This is one of the favorite questions of Data Scientist interviewers and now you know how to ace it!

Here is a quick summary of the same.

Linearity: There must be a linear relationship between the dependent and independent variables.
Homoscedasticity or Constant Error Variance: The variance of the errors is constant across all levels of the independent variables.
Independent Error Terms or No Autocorrelation: There is no correlation between the errors of the variables.
Normality of Residuals: The residuals or errors follow a normal distribution.
No multicollinearity: There exists no correlation between the different independent variables.
Exogeneity (No Endogeneity): There must be no relationship between the independent variables and the errors.

Keep this list handy when you prepare for your interviews.

Hope you enjoyed this! Feel free to leave your feedback and queries below.

Paraphrase with Transformer Models like T5, BART, Pegasus - Ultimate Guide

Utkarsh Kant — Thu, 20 Jul 2023 08:46:19 GMT

Introduction

Paraphrasing is a fundamental skill in effective communication. Whether you're a student, content creator, or professional writer, being able to rephrase information while preserving its essence is crucial.

With the rise of artificial intelligence (AI), transformer models have emerged as powerful tools for automating and enhancing the paraphrasing process.

Understanding Paraphrasing

As per Oxford, Paraphrasing means "to express the meaning of (something written or spoken) using different words, especially to achieve greater clarity".

Let's look at the below example:

Original sentence: "The cat is sitting on the mat."
Paraphrased sentence: "The mat has a cat sitting on it."

Both the sentences while constructed differently, had similar meaning and context. This is paraphrasing.

What's inside 🔍

In this article, we will explore the world of effective & intelligent paraphrasing with transformer models. We'll dive into the underlying concepts of transformers and their advantages over conventional methods.

Additionally, we'll discuss popular transformer models such as BART, T5, and Pegasus that have been specifically designed for paraphrasing tasks.

By the end of this article, you'll have a comprehensive understanding of how transformer models are revolutionizing paraphrasing, and empowering individuals and industries with their transformative capabilities.

And more importantly, how you can build a nifty transformer for yourself.

Let's embark on this journey to unlock the power of AI in effective paraphrasing! 🚀

NOTE: This article is more focused on the applications and not theory, refer to this article to understand how transformers work internally.

Transformer Models for Paraphrasing

In the realm of paraphrasing, transformer models offer significant advantages over traditional approaches.

Unlike previous methods that relied heavily on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers employ self-attention mechanisms.

This enables them to focus on relevant words and phrases, facilitating a deeper understanding of the underlying semantics.

With their ability to capture long-range dependencies and contextual information through attention mechanisms, transformers have revolutionized various language-related tasks, including paraphrasing.

Popular Transformer Models for Paraphrasing

Several popular transformer models have been specifically developed for paraphrasing tasks.

All these transformers can be found in the Huggingface Library. Let's explore:

1. BART (Bidirectional and Auto-Regressive Transformer)

BART is a powerful transformer model by Facebook AI.

It has been trained using denoising autoencoder objectives and is renowned for its ability to generate high-quality paraphrases.

BART has been trained extensively on large-scale datasets and excels in various NLP tasks, especially paraphrasing.

Source: https://huggingface.co/facebook/bart-base

2. T5 (Text-To-Text Transfer Transformer)

T5, developed by Google Research, is a versatile transformer model pre-trained using a text-to-text framework.

While its primary focus is on a wide range of NLP tasks, including translation and summarization, T5 can also be fine-tuned for paraphrasing.

Source: https://huggingface.co/t5-base

3. Pegasus Paraphrase

Pegasus Paraphrase is specifically trained for executing paraphrasing tasks.

Built upon the Pegasus architecture (originally built for text summarization), it leverages the power of transformer models to generate accurate and contextually appropriate paraphrases.

Source: https://huggingface.co/tuner007/pegasus_paraphrase

Paraphrasing with Transformers

Now let us look at how to paraphrase content with these special transformers and also compare their outputs.

Let's first paraphrase a sentence and then extend that to paraphrase long-form content, which is our main goal.

Paraphrasing a Sentence

Let us paraphrase a few random sentences from modern English literature.

"She was a storm, not the kind you run from, but the kind you chase." - R.H. Sin, Whiskey Words & a Shovel III

"She wasn't looking for a knight, she was looking for a sword." - Atticus

"In the end, we only regret the chances we didn't take." - Unknown

"I dreamt I am running on sand in the night" - Yours truly ;)

"Long long ago, there lived a king and a queen. For a long time, they had no children." - Random text on the internet

"I am typing the best article on paraphrasing with Transformers." - You know who!

BART

Here is the code to paraphrase the above two random English sentences with BART.

# imports
from transformers import BartTokenizer, BartForConditionalGeneration

# Load pre-trained BART model and tokenizer
model_name = 'facebook/bart-base'
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Set up input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "In the end, we only regret the chances we didn't take.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "I dreamt I am running on sand in the night"
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True)

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

Running the above code, we get the following output.

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind that you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking at a knight, she was looking for a sword.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: In the end, we only regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: Long long ago, there lived a king and a queen. For a long time, they had no children.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am typing the best article on paraphrasing with Transformers.

We see that BART is not super effective at paraphrasing sentences. Let's try the next transformer.

T5 (Text-to-Text Transfer Transformer)

Here is the code to paraphrase the above two random English sentences with T5.

# imports
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 Base model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base", model_max_length=1024)
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Set up input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "In the end, we only regret the chances we didn't take.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "I dreamt I am running on sand in the night"
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True)

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

And here's the output.

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking for a knight, she was looking for a sword.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We only regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: Long long ago, there lived a king and a queen. Long long ago, they had no children.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: Today I am typing the best article on paraphrasing with Transformers.

As we can see, the T5's output is a little different from BART's, but no significant improvement.

Pegasus Paraphrase

Finally, let's go over the code for Pegasus Paraphrase.

# imports
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# load pre-trained Pegasus Paraphrase model and tokenizer
tokenizer = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase")
model = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase")

# input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "In the end, we only regret the chances we didn't take.",
    "I dreamt I am running on sand in the night",
    "Long long ago, there lived a king and a queen. For a long time, they had no children.",
    "I am typing the best article on paraphrasing with Transformers."
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True)

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

Here's the output.

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She was looking for a sword, not a knight.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I ran on the sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: They had no children for a long time.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am writing the best article on the subject.

We can observe a significant improvement in the output with Pegasus Paraphrase.

Comparing the output of all three transformer models, we can definitively declare Pegasus Paraphrase as the winner.

Paraphrasing a Paragraph

With our testing out of the way, we've finalized Pegasus Paraphrase as our choice of transformer for this task.

Now let's see how we can paraphrase paragraphs and long chunks of texts with it.

Theoretically, there are three main ways to paraphrase whole paragraphs.

1. Adjusting the input length

By default, the maximum input length for Pegasus Paraphrase is set to a certain number of tokens. If the input paragraph exceeds this limit, it might be truncated, leading to incomplete paraphrasing.

Here we split the longer text into smaller chunks and run them through the model individually, then combine the paraphrased results afterward.

2. Use a sliding window approach

Here we take a fixed-sized window and slide it over the input paragraph, generating paraphrases for each window. This way, we ensure that the entire paragraph is covered, albeit with overlapping segments.

3. Optimizing the Beam Search

Beam search is a decoding algorithm that helps in generating diverse outputs from the model.

By default, the model uses beam search with a beam width of 4. We can try to increase the beam width to encourage more exploration and potentially improve the quality of paraphrased outputs for longer texts.

If neither approach gives us satisfactory results, we can look at fine-tuning the model but that's for a different discussion.

In my research and experimentation, I've found that 'Adjusting the input length' gives us the best output. So let's go ahead and implement that.

For a view on challenges with other methods, take a look at the experimentation notebook here.

{insert link to notebook}

Let's paraphrase a paragraph from 'The Hound of the Baskervilles', one of the most popular Sherlock Holmes stories by Sir Arthur Conan Doyle.

"As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."

# imports
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load the Pegasus Paraphrase model and tokenizer
model_name = "tuner007/pegasus_paraphrase"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# function to paraphrase long texts by adjusting the input length
def paraphrase_paragraph(text):

    # Split the text into sentences
    sentences = text.split(".")
    paraphrases = []

    for sentence in sentences:
        # Clean up sentences

        # remove extra whitespace
        sentence = sentence.strip()

        # filter out empty sentences
        if len(sentence) == 0:
            continue

        # Tokenize the sentence
        inputs = tokenizer.encode_plus(sentence, return_tensors="pt", truncation=True, max_length=512)

        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        # paraphrase
        paraphrase = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=4,
            max_length=100,
            early_stopping=True
        )[0]
        paraphrased_text = tokenizer.decode(paraphrase, skip_special_tokens=True)

        paraphrases.append(paraphrased_text)

    # Combine the paraphrases
    combined_paraphrase = " ".join(paraphrases)

    return combined_paraphrase

# Example usage
text = "As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."
paraphrase = paraphrase_paragraph(text)
print(paraphrase)

Here we've split the sentences into smaller chunks like sentences, paraphrase each chunk and then combine the individual outputs back into a paragraph.

And below is the output.

As Sir Henry and I sat at breakfast, the sunlight flooded in through the high windows, causing watery patches of color from the coats of arms. The dark panelling glowed like bronze in the golden rays, and it was hard to see that it was the chamber which had struck such a gloom into our souls the evening before. The evening before, Sir Henry's nerves were still handled and he came to breakfast, his cheeks flushed from the excitement of the early chase.

Concluding thoughts

Throughout this article, we have explored the world of effective paraphrasing with transformer models. And also saw effective applications of how to build a paraphraser with Transformer models from Hugging Face.

Transformer models have brought about a paradigm shift in paraphrasing, empowering individuals and industries with their transformative capabilities. By harnessing the power of transformer models, we can unlock new possibilities in effective communication, content creation, academic writing, and language translation.

As the field of transformer-based paraphrasing continues to evolve, there are exciting opportunities for further exploration and adoption of these technologies.

Researchers and practitioners are encouraged to delve deeper into fine-tuning strategies, data augmentation techniques, and evaluation methodologies to advance the state-of-the-art in paraphrase generation.

Additionally, the ethical implications of using transformer models for paraphrasing should be considered. Careful attention should be given to biases and fairness to ensure equitable and responsible deployment of these technologies.

Let me know your thoughts and any feedback in the comments.

Until next time ... Ciao!

How to split your dataset into train, test, and validation sets?

Utkarsh Kant — Tue, 25 Apr 2023 14:52:22 GMT

Introduction

If you’ve been using the train_test_split method by sklearn to create the train, test, and validation datasets, then I know your pain.

Splitting datasets into the test, train, and validation datasets

While sklearn certainly provides us with a way to achieve our objective, however, it is a long-drawn-out procedure as we have to repeat the process twice adjusting the split ratio with every step.

But rejoice, fast_ml is here!

It offers a straightforward and to-the-point method to achieve the three different datasets with a single line of code.

It is the train_valid_test_split method!

It not only splits the data as we require but also separates the dependent variable y from the independent variables X in the same line of code.

Code walkthrough

Let’s check out how it’s done (notebook)!

Step 1: Download the fast_ml library and Import the necessary packages and methods

Step 2: Load the dataset into a pandas data frame.

Step 3: Split the dataset

Once the data is loaded and ready to split, simply call the train_valid_test_split method and pass the dataset with the supporting parameters as below.

The datasets have been successfully split into train, test, and validation datasets. 🎉

💡 NOTE
The split datasets retain their original index and resetting it is an optional step.

You can now proceed with your modeling.

Conclusion

Thanks to the team at fast_ml, the long-drawn-out task of splitting our dataset into independent and dependent features and then into training, testing, and validation datasets has been condensed into a single line of code. ⚡

You can find this notebook here:

https://github.com/utkarshkant/25-short-code-snippets_Python/blob/master/train_valid_test_split.ipynb

Let me know how you liked this quick article in the comments below, and feed free to reach out!

Data Made Easy: A Comprehensive Guide for Beginners

Utkarsh Kant — Tue, 18 Apr 2023 05:02:11 GMT

Introduction

Data is all around us, from the information we process every day to the data collected by businesses to make informed decisions.

Businesses today are thriving on the data that they have collected over the years. This data is then utilized intelligently to make informed business decisions.

But understanding the fundamentals of data itself and then utilizing it can be a daunting task, especially for beginners.

That's where this comprehensive guide comes in. We'll break down the concept of data at its most fundamental level, giving you the tools and techniques you need to handle it like a pro.

So let's dive in and make data easy!

So, what is data?

💡 All information essentially can be classified as data.

It can come in multiple different forms, shapes, and sizes. It can be in the form of numbers, text, images, videos, and much more.

Defining data

Isn’t data a simple concept, why should we learn more about it?

By now we know that all information is data. And from our discussion on statistics, we also know that

💡 Data lies at the heart of any analytical solution. Therefore, without data, there is no statistic. And without statistics, there is no analysis.

The first and most crucial step of solving any problem, be it statistics, analytics, data science, machine learning, etc., is to understand the data at hand.

Different types of data

We spoke about the different forms of data. And there are a few different ways of classifying data, and each serves a specific purpose.

Let’s go over the most popular types of data and see how they are classified.

The two major types of data are:

1 — Unstructured data

💡 As the name suggests, this type of data cannot be organized into a structure or a data model.

Some of the popular examples are images, heatmaps, videos, spatial data, graph data, text documents, etc.

Unstructured data is not easily identifiable or interpretable either by humans or machines. It takes the machine some special features to process this data.

By now, you must have realized that this type of data is a bad fit for traditional relational databases like SQL.

2 — Structured data

💡 On the other hand, this type of data can be defined into a structure (as the name suggests)

These are more commonly used in industrial settings and one of the most commonly used types of structured data is the 2-dimensional data structure, that is, the humble table, also known as rectangular data.

A simple table capturing exam results of different students

I'm sure you can think of enough use cases from your own life where you have used Excel spreadsheets to store some information.

Another popular example of rectangular data is the Titanic dataset [Source]

Structured data is further classified into a few different types of data. They are:

Categorical data
Numerical data

And even the above types of data can be classified further into different data types. Let's look at a complete breakdown before proceeding with each type of data.

Different types of data

2.1 — Categorical data

💡 The type of data that can be categorized [Genius 🕵️‍♂️].

Now consider the dataset of students’ exam results. Depending on the grade, all students with grades other than F are deemed to have passed the examination.

So we add another column with the Passing status of each student. The column Pass/Fail has only one of the two entries, it can either be a Pass or a Fail.

Exam results of different students

Similarly, there are many instances where the entry in each row is one of the few available options. For example:

Binary data: True or False, Yes or No, 0 or 1
Exam grades: A, B, C, D, E, and F
Laptop brands: Asus, Lenovo, Macbook, Dell, IBM, etc.

⚠️ The entry for a categorical data record can only be one of the available options. For example, it can be either True or False, but not both.

Now there are 2 important types of categorical data as well, and they are:

2.1.1 — Nominal data

💡 Type of categorical data that has no internal order or precedence amongst the different categories. The categories cannot be ranked one over the other.

For example, in binary data like True or False, Male or Female, one category is not more important than the other.

⚠️ There can be some exceptions here as well, refer to the upcoming exercise section of this article for an explanation.

Another example would be subjects like English, Mathematics, Science, History, etc. As long as they carry equal weightage, one cannot have more importance than the other.

2.1.2 — Ordinal data

Here the categories can be ordered and the order in which categories are spread matters.

For example, grades in exam results can be ordered as A, B, C, D, E, & F, from higher rank to lower. Another simple ranking could be in the cloth sizes, which may range from XS, S, M, L, to XL.

⚠️ SPOILER ALERT! 🤖 The knowledge of Nominal and Ordinal datatypes becomes very critical during encoding for machine learning problems.

2.2 — Numerical data

💡 While categorical data is discrete in nature. On the other hand, numerical data is continuous. It can have any numerical value.

It can be either integers or real numbers. For example, the students’ marks in exams, the speed of a car, the length of a video, height, weight, etc.

Again, there are two major types of Numerical data, and they are:

2.2.1 — Discrete data

💡 When the data records can be counted & expressed only in whole numbers, it is called discrete data.

For example, the number of children in a class, the number of cars owned by a person, the number of working days in a month, and many more.

2.2.2 — Continuous data

💡 When the data records can be infinite or expressed in real numbers to many decimals places, the data is known as continuous data.

For example, exact height, weight, & other ratios like π cannot be recorded absolutely accurately in 2 decimal places.

Some special data types

Apart from the above, there are a few other important data types that you should know.

1. Time series

💡 Anything measured over time is time series.

For example, daily or monthly stock prices, daily weather, hourly sea level, speed of a vehicle at every minute, etc.

More often than not, time series is structured data.

An example of time-series data. Records of product demand, precipitation, & temperature over the years.

2. Text data

💡 Text data usually consists of documents containing words, sentences, and paragraphs of free-flowing text.

It can be in any language. And is mostly unstructured.

A good example is the product reviews on Amazon, which can be utilized for sentiment classification. Or email contents that enable machine learning algorithms to detect spam emails from the rest.

After filtering out the spam, Gmail automatically categorizes emails into Primary, Social, & Promotions based on the text data in the email contents.

3. Image & Video data

💡 This is the graphic or pictorial data like images or drawings.

This finds a great use case in object detection, self-driving cars, etc. It is a form of unstructured data as well.

Object detection from live footage.

4. Audio data

💡 Any information recorded in the audio format is data.

Another popular format of data is Audio, that is widely used in machine learning applications. Apps like Shazam are a great example of the same.

Be it a song, a speech, an audiobook, or any other information recorded in the audio format can be used as data.

Assignment

Now that we have quickly understood so many different concepts, let’s strengthen our understanding with these fun exercises.

Assignment 1

Let’s classify each variable in the Titanic dataset into its correct data type.

![Titanic dataset [](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fa2f6872e-2d9e-487d-b9d7-278749de2e50%2FUntitled.png?id=f78c9c8b-eaab-4b74-b6d8-5dfaab22f298&table=block align="left")

Titanic dataset [Source]

PassengerId: The unique Id for each passenger. This is numerical data and is discrete.
Survived: Passenger survived or not, 0 = No, 1 = Yes. This is categorical data and nominal.
Pclass: This is the ticket type, class 1 = 1st, 2 = 2nd, 3 = 3rd. This is categorical data and ordinal.
Name: Name of the passenger. This is text data.
Sex: Gender of the passenger. This is binary categorical data and ordinal.

⚠️ IMPORTANT

In some cases, even data records like sex or gender can be ordinal, and this is one such case. This is because the captain of the ship explicitly issued an order for women and children to be saved first. As a result, the survival rate for women was three times higher than for men [Source]. Therefore, while modeling, the algorithm can give a slightly higher preference to females while predicting the survival status.

Similarly, we can analyze the rest of the variables of this dataset. I will leave this exercise for you to complete.

Assignment 2

Considered a video being streamed on Youtube. Multiple different data points are being recorded in real-time simultaneously.

Some of them are video, audio, images, resolutions, time stamps, total people watching at each timestamp, likes, dislikes, text data from the continuous chat, number of comments, transactions, engagement, and much more.

Your task is to classify each feature being recorded into its correct class. Do share your observations in the comments section.

Summary

So let’s summarize what we discussed today.

Data is everywhere and every recorded information is data
Data lies at the heart of any statistical analysis
Different types of data, a quick breakdown is below.

Different types of data

How a machine reads data?

Finally, after understanding the foundations and different types of data.

Let's understand how machines are reading and interpreting data as opposed to humans.

Foundationally, no matter the type of data, the machines can only ingest 0s and 1s. Therefore, for us to train a machine learning model on our data, we must convert it into 0s and 1s.

Be it images, text, audio, or any other data type, everything has to be converted into 0s and 1s (or numeric) before feeding it to the machine.

Converting image to 0s and 1s

NOTE: We will look at multiple examples in the coming discussions where we build machine-learning models on different types of data.

Conclusion

I am certain that this discussion will help you better understand your data at a more fundamental level, which will refine your analysis.

Feel free to share your feedback or queries in the comments below.

KantCodes.com

The Complete Guide to Encoding Categorical Features

Introduction

Why do we need to Encode Categorical Features?

Types of Categorical Features

Nominal Categorical Features

Ordinal Categorical Features

Challenges with Categorical Features

What we will cover today?

Label Encoding

How it works?

Code Implementation

When to Use Label Encoding?

One-Hot Encoding or Dummy Encoding

How it works?

Code Implementation

Advantages of One-Hot Encoding

When to Use?

Challenges with one-hot encoding

Dummy Variable Trap 💡

Curse of Dimensionality

Multicollinearity

Data Sparsity

Binary Encoding

How it works?

Code Implementation

Advantages

When to Use?

Ordinal Encoding

How it Works?

Code Implementation

Advantages

When to Use Ordinal Encoding

Frequency Encoding or Count Encoding

How it works?

Code Implementation

Advantages of Frequency Encoding

When to Use Frequency Encoding

Target Encoding or Mean Encoding

How Target Encoding Works

Code Implementation

Best Practices

Advantages of Target Encoding

When to Use Target Encoding

Feature Hashing or Hashing Trick

How Feature Hashing Works

Code Implementation

When to Use Feature Hashing

Concluding thoughts

Assumptions of Linear Regression - Ace the Most Asked Interview Question

Introduction

A quick review of Linear Regression

Assumption of Linear Regression Model

1. Linearity

How to detect linearity between dependent & independent features?

2. Homoscedasticity or Constant Error Variance

How to detect Heteroscedasticity in data?

How to remove Heteroscedasticity in data?

3. Independent Error Terms or No Autocorrelation

How to detect autocorrelation?

4. Normality of Residuals

How to detect normality in errors?

How to bring normality in errors?

5. No Multi-collinearity

How does this affect our model?

How to detect multicollinearity?

How to remove multicollinearity?

6. Exogeneity (or No Endogeneity)

Summary

Paraphrase with Transformer Models like T5, BART, Pegasus - Ultimate Guide

Introduction

Understanding Paraphrasing

What's inside 🔍

Transformer Models for Paraphrasing

Popular Transformer Models for Paraphrasing

1. BART (Bidirectional and Auto-Regressive Transformer)

2. T5 (Text-To-Text Transfer Transformer)

3. Pegasus Paraphrase

Paraphrasing with Transformers

Paraphrasing a Sentence