<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[KantCodes.com]]></title><description><![CDATA[Data Scientist by the day . . . Batman by Knight!]]></description><link>https://kantcodes.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 08 Apr 2026 11:04:59 GMT</lastBuildDate><atom:link href="https://kantcodes.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[The Complete Guide to Encoding Categorical Features]]></title><description><![CDATA[Introduction
In the world of data analysis and machine learning, data comes in all shapes and sizes.
Categorical data is one of the most common forms of data that you will encounter in your data science journey. It represents discrete, distinct categ...]]></description><link>https://kantcodes.com/complete-guide-to-encoding-categorical-features</link><guid isPermaLink="true">https://kantcodes.com/complete-guide-to-encoding-categorical-features</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[exploratory data analysis]]></category><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[Deep Learning]]></category><dc:creator><![CDATA[Utkarsh Kant]]></dc:creator><pubDate>Tue, 30 Jan 2024 06:00:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/GnvurwJsKaY/upload/e3a471123a78381ccc51458ec9ab8820.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>In the world of data analysis and machine learning, data comes in all shapes and sizes.</p>
<p><strong>Categorical data</strong> is one of the most common forms of data that you will encounter in your data science journey. It represents discrete, distinct categories or labels, and it's an essential part of many real-world datasets.</p>
<p>In this article, we will discuss the best techniques to encode categorical features in great detail along with their code implementations. We will also discuss the best practices and how to select the right encoding technique.</p>
<p>The objective of this article is to serve as a ready reference for whenever you wish to encode categorical features in your dataset.</p>
<h2 id="heading-why-do-we-need-to-encode-categorical-features">Why do we need to Encode Categorical Features?</h2>
<p>Many machine learning algorithms require numerical input.</p>
<p>Categorical data, being non-numeric, needs to be transformed into a numerical format for these algorithms to work.</p>
<h2 id="heading-types-of-categorical-features">Types of Categorical Features</h2>
<p>Categorical features are encoded based on the their types and functions. They can be broadly divided into two categories: <strong>Nominal</strong> and <strong>Ordinal</strong>.</p>
<p><a target="_blank" href="https://kantschants.com/data-complete-guide"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681125110224/b193acad-2dfd-4819-b2f7-8f8829943336.png?auto=compress,format&amp;format=webp" alt /></a></p>
<ol>
<li><h3 id="heading-nominal-categorical-features">Nominal Categorical Features</h3>
</li>
</ol>
<p>Nominal features are those where the categories have no inherent order or ranking.</p>
<p>For example, the colors of cars (red, blue, green) are nominal because there's no natural order to them.</p>
<ol>
<li><h3 id="heading-ordinal-categorical-features">Ordinal Categorical Features</h3>
</li>
</ol>
<p>Ordinal features are those where the categories have a meaningful order or rank.</p>
<p>Think of education levels (high school, bachelor's, master's, Ph.D.), for which there is a clear ranking.</p>
<blockquote>
<p>Learn more about categorical data &amp; other types of data from the below resource.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://kantschants.com/data-complete-guide#heading-21-categorical-data">https://kantschants.com/data-complete-guide#heading-21-categorical-data</a></div>
<p> </p>
</blockquote>
<h2 id="heading-challenges-with-categorical-features">Challenges with Categorical Features</h2>
<p>Categorical data brings its own set of challenges when it comes to data analysis and machine learning. Here are some key challenges:</p>
<ul>
<li><p><strong>Numerical Requirement</strong>: Many machine learning algorithms require numerical input. Categorical data, being non-numeric, needs to be transformed into a numerical format for these algorithms to work.</p>
</li>
<li><p><strong>Curse of Dimensionality</strong>: One-hot encoding, a common technique, can lead to a high number of new columns (dimensions) in your dataset, which can increase computational complexity and storage requirements.</p>
</li>
<li><p><strong>Multicollinearity</strong>: In one-hot encoding, the newly created binary columns can be correlated, which can be problematic for some models that assume independence between features.</p>
</li>
<li><p><strong>Data Sparsity</strong>: When one-hot encoding is used, it can lead to sparse matrices, where most of the entries are zero. This can be memory-inefficient and affect model performance.</p>
</li>
</ul>
<h2 id="heading-what-we-will-cover-today">What we will cover today?</h2>
<p>The encoding techniques we will discuss today are listed below:</p>
<ol>
<li><p>Label Encoding</p>
</li>
<li><p>One-hot Encoding</p>
</li>
<li><p>Binary Encoding</p>
</li>
<li><p>Ordinal Encoding</p>
</li>
<li><p>Frequency Encoding or Count Encoding</p>
</li>
<li><p>Target Encoding or Mean Encoding</p>
</li>
<li><p>Feature Hashing or Hashing Trick</p>
</li>
</ol>
<p>Let us discuss each in detail.</p>
<ol>
<li><h2 id="heading-label-encoding">Label Encoding</h2>
</li>
</ol>
<p>Label encoding is one of the fundamental techniques for converting categorical data into a numerical format. It assigns numbers in increasing order to the labels in an ordinal categorical feature.</p>
<p>It is a simple yet effective method that assigns a unique integer to each category in a feature.</p>
<h3 id="heading-how-it-works">How it works?</h3>
<p>Imagine a feature 'Size' that has the following labels: 'Small', 'Medium', and 'Large'. This is an ordinal categorical feature as there is an inherent order in the labels.</p>
<p>We can encode these labels as following:</p>
<ul>
<li><p>Small → 0</p>
</li>
<li><p>Medium → 1</p>
</li>
<li><p>Large → 2</p>
</li>
</ul>
<h3 id="heading-code-implementation">Code Implementation</h3>
<p>Let us look at the code implementation for Label Encoding.</p>
<pre><code class="lang-python"><span class="hljs-comment"># necessary imports</span>
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> LabelEncoder

<span class="hljs-comment"># Sample data</span>
data = [<span class="hljs-string">"Small"</span>, <span class="hljs-string">"Medium"</span>, <span class="hljs-string">"Large"</span>, <span class="hljs-string">"Medium"</span>, <span class="hljs-string">"Small"</span>]
print(data)     <span class="hljs-comment"># Output: ['Red', 'Green', 'Blue', 'Red', 'Green']</span>

<span class="hljs-comment"># Initialize the label encoder</span>
label_encoder = LabelEncoder()

<span class="hljs-comment"># Fit and transform the data</span>
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)  <span class="hljs-comment"># Output: [2, 1, 0, 2, 1]</span>
</code></pre>
<h3 id="heading-when-to-use-label-encoding">When to Use Label Encoding?</h3>
<p>Label encoding is a suitable choice for:</p>
<ul>
<li><p>Ordinal data or features with a clear and meaningful order.</p>
</li>
<li><p>Not increasing dimensionality in the dataset.</p>
</li>
</ul>
<ol>
<li><h2 id="heading-one-hot-encoding-or-dummy-encoding">One-Hot Encoding or Dummy Encoding</h2>
</li>
</ol>
<p>One-hot encoding, also popularly known as dummy encoding, is a widely used technique for converting categorical data into a numerical format.</p>
<p>It's particularly suitable for nominal categorical features, where the categories have no inherent order or ranking.</p>
<h3 id="heading-how-it-works-1">How it works?</h3>
<p>One-hot encoding transforms each label (or category) in a categorical feature into a binary column.</p>
<p>Each binary column corresponds to a specific category and indicates the presence (1) or absence (0) of that category in the original feature.</p>
<p>For example, consider a categorical feature "Color" with three labels: "Red," "Green," and "Blue." One-hot encoding would create three binary columns like this:</p>
<ul>
<li><p>"Red" → [1, 0, 0]</p>
</li>
<li><p>"Green" → [0, 1, 0]</p>
</li>
<li><p>"Blue" → [0, 0, 1]</p>
</li>
</ul>
<h3 id="heading-code-implementation-1">Code Implementation</h3>
<p>Let us look at the code implementation for One-Hot Encoding.</p>
<pre><code class="lang-python"><span class="hljs-comment"># necessary imports</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">'Color'</span>: [<span class="hljs-string">'Red'</span>, <span class="hljs-string">'Green'</span>, <span class="hljs-string">'Blue'</span>, <span class="hljs-string">'Red'</span>, <span class="hljs-string">'Green'</span>]})

<span class="hljs-comment"># Perform one-hot encoding</span>
encoded_data = pd.get_dummies(data, columns=[<span class="hljs-string">'Color'</span>])
</code></pre>
<p>The output will look like below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706537791318/4d84f1e0-a5f9-4063-b231-72acdc4932ed.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-advantages-of-one-hot-encoding">Advantages of One-Hot Encoding</h3>
<p>The primary advantage of one-hot encoding is that it maintains the distinctiveness of labels and prevents any unintended ordinality.</p>
<p>Each label becomes a separate feature, and the presence or absence of a category is explicitly represented.</p>
<h3 id="heading-when-to-use">When to Use?</h3>
<p>One-hot encoding is an appropriate choice when:</p>
<ul>
<li><p>Dealing with nominal data with no meaningful order among labels.</p>
</li>
<li><p>Maintaining the distinction between categories (or labels) is crucial, and no ordinality must be introduced.</p>
</li>
<li><p>It handles missing values the absence of a category results in all zeros in the one-hot encoded columns.</p>
</li>
</ul>
<h3 id="heading-challenges-with-one-hot-encoding">Challenges with one-hot encoding</h3>
<ol>
<li><h4 id="heading-dummy-variable-trap">Dummy Variable Trap 💡</h4>
</li>
</ol>
<p>Be aware of the "dummy variable trap," where multicollinearity can occur if one column can be predicted from the others.</p>
<p>To avoid this, you can safely drop one of the one-hot encoded columns, reducing the dimensionality by one. You can declare the <code>drop_first=True</code> in the <code>get_dummies</code> function as shown below.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">'Color'</span>: [<span class="hljs-string">'Red'</span>, <span class="hljs-string">'Green'</span>, <span class="hljs-string">'Blue'</span>, <span class="hljs-string">'Red'</span>, <span class="hljs-string">'Green'</span>]})

<span class="hljs-comment"># Perform one-hot encoding</span>
encoded_data = pd.get_dummies(data, columns=[<span class="hljs-string">'Color'</span>], drop_first=<span class="hljs-literal">True</span>)
</code></pre>
<p>Output:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706538220635/31418b98-033e-40c9-b3f4-f3b77071afd1.png" alt class="image--center mx-auto" /></p>
<ol>
<li><h4 id="heading-curse-of-dimensionality">Curse of Dimensionality</h4>
</li>
</ol>
<p>One-hot encoding can lead to a high number of new columns (dimensions) in your dataset, which can increase computational complexity and storage requirements.</p>
<ol>
<li><h4 id="heading-multicollinearity">Multicollinearity</h4>
</li>
</ol>
<p>In one-hot encoding, the newly created binary columns can be correlated, which can be problematic for some models that assume independence between features.</p>
<ol>
<li><h4 id="heading-data-sparsity">Data Sparsity</h4>
</li>
</ol>
<p>When one-hot encoding is used, it can lead to sparse matrices, where most of the entries are zero. This can be memory-inefficient and affect model performance.</p>
<ol>
<li><h2 id="heading-binary-encoding">Binary Encoding</h2>
</li>
</ol>
<p>Binary encoding is a versatile technique for encoding categorical features, especially when dealing with high-cardinality data.</p>
<p>It combines the benefits of one-hot and label encoding while reducing dimensionality.</p>
<h3 id="heading-how-it-works-2">How it works?</h3>
<p>Binary encoding works by converting each category into binary code and representing it as a sequence of binary digits (<strong>0</strong>s and <strong>1</strong>s).</p>
<p>Each binary digit is then placed in a separate column, effectively creating a set of binary columns for each category.</p>
<p>The encoding process is as follows:</p>
<ol>
<li><p>Assign a unique integer to each category, similar to label encoding.</p>
</li>
<li><p>Convert the integer to binary code.</p>
</li>
<li><p>Create a set of binary columns to represent the binary code.</p>
</li>
</ol>
<p>For example, consider a categorical feature "Country" with categories "USA," "Canada," and "UK."</p>
<p>Binary encoding would involve assigning unique integers to each country (e.g., "USA" -&gt; 1, "Canada" -&gt; 2, "UK" -&gt; 3) and then converting these integers to binary code. The binary digits (0s and 1s) are then placed in separate binary columns:</p>
<ul>
<li><p>"USA" → 1 → 001 → [0, 0, 1]</p>
</li>
<li><p>"Canada" → 2 → 010 → [0, 1, 0]</p>
</li>
<li><p>"UK" → 3 → 100 → [1, 0, 0]</p>
</li>
</ul>
<h3 id="heading-code-implementation-2">Code Implementation</h3>
<p>Let us go through an example in Python.</p>
<pre><code class="lang-python"><span class="hljs-comment"># necessary imports</span>
<span class="hljs-keyword">import</span> category_encoders <span class="hljs-keyword">as</span> ce
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">'Country'</span>: [<span class="hljs-string">'USA'</span>, <span class="hljs-string">'Canada'</span>, <span class="hljs-string">'UK'</span>, <span class="hljs-string">'USA'</span>, <span class="hljs-string">'UK'</span>]})

<span class="hljs-comment"># Initialize the binary encoder</span>
encoder = ce.BinaryEncoder(cols=[<span class="hljs-string">'Country'</span>])

<span class="hljs-comment"># Fit and transform the data</span>
encoded_data = encoder.fit_transform(data)
</code></pre>
<p>The output is below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706538397980/24d2095f-d86c-4356-abf7-2ca0395ad209.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-advantages">Advantages</h3>
<ul>
<li><p>It combines the advantages of both one-hot encoding and label encoding, efficiently converting categorical data into a binary format.</p>
</li>
<li><p>It is memory efficient and overcomes the curse of dimensionality.</p>
</li>
<li><p>Finally, it is easy to implement and interpret.</p>
</li>
</ul>
<h3 id="heading-when-to-use-1">When to Use?</h3>
<p>Binary encoding is a suitable choice when:</p>
<ul>
<li><p>Dealing with high-cardinality categorical features (features with a large number of unique categories).</p>
</li>
<li><p>You want to reduce the dimensionality compared to one-hot encoding, especially for features with many unique categories.</p>
</li>
</ul>
<ol>
<li><h2 id="heading-ordinal-encoding">Ordinal Encoding</h2>
</li>
</ol>
<p>As the name suggests, Ordinal Encoding encodes the categories in an ordinal feature by mapping it to integer values in ascending order of rank.</p>
<h3 id="heading-how-it-works-3">How it Works?</h3>
<p>The process of ordinal encoding involves mapping each category to a unique integer, typically based on their order or rank.</p>
<p>Consider an ordinal feature "Education Level" with categories: "High School," "Associate's Degree," "Bachelor's Degree," "Master's Degree," and "PhD".</p>
<p>Ordinal encoding will assign integer values as follows:</p>
<ul>
<li><p>"High School" → 0</p>
</li>
<li><p>"Associate's Degree" → 1</p>
</li>
<li><p>"Bachelor's Degree" → 2</p>
</li>
<li><p>"Master's Degree" → 3</p>
</li>
<li><p>"PhD" → 4</p>
</li>
</ul>
<p>These integer values reflect the ordinal relationship between the education levels.</p>
<h3 id="heading-code-implementation-3">Code Implementation</h3>
<p>Here's how we implement Ordinal Encoding in Python.</p>
<pre><code class="lang-python"><span class="hljs-comment"># necessary imports</span>
<span class="hljs-keyword">import</span> category_encoders <span class="hljs-keyword">as</span> ce
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">"Education Level"</span>: [<span class="hljs-string">"High School"</span>, <span class="hljs-string">"Bachelor's Degree"</span>, <span class="hljs-string">"Master's Degree"</span>, <span class="hljs-string">"PhD"</span>, <span class="hljs-string">"Associate's Degree"</span>]})

<span class="hljs-comment"># Define the ordinal encoding mapping</span>
education_mapping = {
    <span class="hljs-string">'High School'</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"Associate's Degree"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"Bachelor's Degree"</span>: <span class="hljs-number">2</span>,
    <span class="hljs-string">"Master's Degree"</span>: <span class="hljs-number">3</span>,
    <span class="hljs-string">'PhD'</span>: <span class="hljs-number">4</span>
}

<span class="hljs-comment"># Perform ordinal encoding</span>
encoder = ce.OrdinalEncoder(mapping=[{<span class="hljs-string">'col'</span>: <span class="hljs-string">'Education Level'</span>, <span class="hljs-string">'mapping'</span>: education_mapping}])
encoded_data = encoder.fit_transform(data)
</code></pre>
<p>Output:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706538635525/91153028-e373-491c-b75a-8258d39e3c2b.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-advantages-1">Advantages</h3>
<ul>
<li><p>It captures and preserves the ordinal relationships between categories, which can be valuable for certain types of analyses.</p>
</li>
<li><p>It reduces the dimensionality of the dataset compared to one-hot encoding.</p>
</li>
<li><p>It provides a numerical representation of the data, making it suitable for many machine learning algorithms.</p>
</li>
</ul>
<h3 id="heading-when-to-use-ordinal-encoding">When to Use Ordinal Encoding</h3>
<p>Ordinal encoding is an appropriate choice when:</p>
<ul>
<li><p>Dealing with categorical features that exhibit a clear and meaningful order or ranking.</p>
</li>
<li><p>Preserving the ordinal relationship among categories is essential for your analysis or model.</p>
</li>
<li><p>You want to convert the data into a numerical format while maintaining the inherent order of the categories.</p>
</li>
</ul>
<ol>
<li><h2 id="heading-frequency-encoding-or-count-encoding">Frequency Encoding or Count Encoding</h2>
</li>
</ol>
<p>Frequency encoding, also known as count encoding, is a technique that encodes categorical features based on the frequency of each category in the dataset.</p>
<p>This method assigns each category a numerical value representing how often it occurs. It's a straightforward approach that can be effective in certain scenarios.</p>
<p>Categories that appear more frequently receive higher values, while less common categories receive lower values. This provides a numerical representation of the categories based on their prevalence.</p>
<h3 id="heading-how-it-works-4">How it works?</h3>
<p>The process involves mapping each category to its frequency or count within the dataset.</p>
<p>Consider a categorical feature "City" with categories "New York," "Los Angeles," "Chicago," and "San Francisco." If "New York" appears 50 times, "Los Angeles" 30 times, "Chicago" 20 times, and "San Francisco" 10 times, frequency encoding will assign values as follows:</p>
<ul>
<li><p>"New York" → 50</p>
</li>
<li><p>"Los Angeles" → 30</p>
</li>
<li><p>"Chicago" → 20</p>
</li>
<li><p>"San Francisco" → 10</p>
</li>
</ul>
<blockquote>
<p>💡 NOTE</p>
<p>Frequency or Count Encoding is specially effective where the frequency of categories in a feature has a significant impact.</p>
<p>It should not be applied to ordinal categorical features.</p>
</blockquote>
<h3 id="heading-code-implementation-4">Code Implementation</h3>
<p>The implementation here is pretty straightforward.</p>
<pre><code class="lang-python"><span class="hljs-comment"># imports</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">'City'</span>: [<span class="hljs-string">'New York'</span>, <span class="hljs-string">'Los Angeles'</span>, <span class="hljs-string">'Chicago'</span>, <span class="hljs-string">'New York'</span>, <span class="hljs-string">'Los Angeles'</span>, <span class="hljs-string">'Chicago'</span>, <span class="hljs-string">'Chicago'</span>, <span class="hljs-string">'New York'</span>, <span class="hljs-string">'New York'</span>]})

<span class="hljs-comment"># frequency encoding</span>
frequency_encoding = data[<span class="hljs-string">'City'</span>].value_counts().to_dict()
data[<span class="hljs-string">'Encoded_City'</span>] = data[<span class="hljs-string">'City'</span>].map(frequency_encoding)
</code></pre>
<p>Output below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706540513342/07d8aa28-2ee3-40ba-aadd-8a9f402ea929.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-advantages-of-frequency-encoding">Advantages of Frequency Encoding</h3>
<p>Frequency encoding offers the following advantages:</p>
<ul>
<li><p>It encodes categorical data in a straightforward and interpretable way, preserving the count information.</p>
</li>
<li><p>Particularly useful when the frequency of categories is a relevant feature for the problem you're solving.</p>
</li>
<li><p>It reduces dimensionality compared to one-hot encoding, which can be beneficial in high-cardinality scenarios.</p>
</li>
</ul>
<h3 id="heading-when-to-use-frequency-encoding">When to Use Frequency Encoding</h3>
<p>Frequency encoding is an appropriate choice when:</p>
<ul>
<li><p>Analyzing categorical features where the frequency of each category is relevant information for your model.</p>
</li>
<li><p>Reducing the dimensionality of the dataset compared to one-hot encoding while preserving the information about category frequency.</p>
</li>
</ul>
<ol>
<li><h2 id="heading-target-encoding-or-mean-encoding">Target Encoding or Mean Encoding</h2>
</li>
</ol>
<p>Target encoding, also known as Mean Encoding, is a powerful technique used to encode categorical features when the target variable is categorical.</p>
<p>It assigns a numerical value to each category based on the mean of the target variable within that category.</p>
<p>Target encoding is particularly useful in classification problems. It captures how likely each category is to result in the target variable taking a specific value.</p>
<h3 id="heading-how-target-encoding-works">How Target Encoding Works</h3>
<p>The process of target encoding involves mapping each category to the mean of the target variable for data points within that category. This encoding method provides a direct relationship between the categorical feature and the target variable.</p>
<p>Consider a categorical feature "Region" with categories "North," "South," "East," and "West." If we're dealing with a binary classification problem where the target variable is "Churn" (0 for no churn, 1 for churn), target encoding might assign values as follows:</p>
<ul>
<li><p>"North" → Mean of "Churn" for data points in the "North" category</p>
</li>
<li><p>"South" → Mean of "Churn" for data points in the "South" category</p>
</li>
<li><p>"East" → Mean of "Churn" for data points in the "East" category</p>
</li>
<li><p>"West" → Mean of "Churn" for data points in the "West" category</p>
</li>
</ul>
<h3 id="heading-code-implementation-5">Code Implementation</h3>
<p>Here's a Python code example for target encoding using the <code>category_encoders</code> library:</p>
<pre><code class="lang-python"><span class="hljs-comment"># imports</span>
<span class="hljs-keyword">import</span> category_encoders <span class="hljs-keyword">as</span> ce
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">'Region'</span>: [<span class="hljs-string">'North'</span>, <span class="hljs-string">'South'</span>, <span class="hljs-string">'East'</span>, <span class="hljs-string">'West'</span>, <span class="hljs-string">'North'</span>, <span class="hljs-string">'South'</span>], 
                     <span class="hljs-string">'Churn'</span>: [<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>]})

<span class="hljs-comment"># Perform target encoding</span>
encoder = ce.TargetEncoder(cols=[<span class="hljs-string">'Region'</span>])
encoded_data = encoder.fit_transform(data, data[<span class="hljs-string">'Churn'</span>])
</code></pre>
<p>Output is shared below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706539064729/eb4320b9-990f-47ff-8547-161c03ed8e98.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-best-practices">Best Practices</h3>
<p>When using target encoding, consider the following best practices:</p>
<ul>
<li><p>Be cautious about potential data leakage, as the mean of the target variable is used in the encoding process. Ensure you're not using information from the test or validation set when encoding.</p>
</li>
<li><p>Use cross-validation or other techniques to prevent overfitting and improve the robustness of target encoding.</p>
</li>
</ul>
<h3 id="heading-advantages-of-target-encoding">Advantages of Target Encoding</h3>
<p>Target encoding offers several advantages:</p>
<ul>
<li><p>It captures the relationship between the categorical feature and the target variable, making it useful in classification problems.</p>
</li>
<li><p>It provides a direct and interpretable way to encode categorical features.</p>
</li>
<li><p>It reduces dimensionality compared to one-hot encoding while preserving valuable information about category-specific behavior.</p>
</li>
</ul>
<h3 id="heading-when-to-use-target-encoding">When to Use Target Encoding</h3>
<p>Target encoding is an appropriate choice when:</p>
<ul>
<li><p>Working with categorical features and a categorical target variable in classification problems.</p>
</li>
<li><p>You want to capture the relationship between the categorical feature and the target variable, helping the model make predictions based on category-specific behavior.</p>
</li>
</ul>
<ol>
<li><h2 id="heading-feature-hashing-or-hashing-trick">Feature Hashing or Hashing Trick</h2>
</li>
</ol>
<p>A rather under-appreciated encoding technique, Feature Hashing, also known as the Hashing Trick, is a method used to encode high-cardinality categorical features efficiently.</p>
<p>It works by applying a hash function to the categorical data, reducing the dimensionality of the feature while still providing a numerical representation.</p>
<blockquote>
<p>💡 Feature hashing is particularly useful when dealing with large datasets with many unique categories.</p>
</blockquote>
<h3 id="heading-how-feature-hashing-works">How Feature Hashing Works</h3>
<p>The feature hashing process involves applying a hash function to the categorical data, which maps each category to a fixed number of numerical columns.</p>
<p>The hash function distributes the categories across these columns, and each category contributes to the values of multiple columns.</p>
<h3 id="heading-code-implementation-6">Code Implementation</h3>
<p>Let's implement this in Python.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> category_encoders <span class="hljs-keyword">as</span> ce
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Sample data</span>
data = pd.DataFrame({<span class="hljs-string">'Product Category'</span>: [<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'A'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'D'</span>, <span class="hljs-string">'E'</span>, <span class="hljs-string">'D'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'A'</span>]})

<span class="hljs-comment"># Perform feature hashing with three columns</span>
encoder = ce.HashingEncoder(cols=[<span class="hljs-string">'Product Category'</span>], n_components=<span class="hljs-number">3</span>)
encoded_data = encoder.fit_transform(data)
</code></pre>
<p>Output is shared below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706539418895/7cf94d9a-7333-412e-8400-b02b3ec129b6.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-when-to-use-feature-hashing">When to Use Feature Hashing</h3>
<p>Feature hashing is an appropriate choice when:</p>
<ul>
<li><p>Dealing with high-cardinality categorical features that have too many unique categories to handle using one-hot encoding or other techniques.</p>
</li>
<li><p>Reducing the dimensionality of the dataset while retaining the essential information from the categorical feature.</p>
</li>
<li><p>Memory and computational resources are limited, making it challenging to work with a high number of binary columns.</p>
</li>
<li><p>Define feature hashing.</p>
</li>
<li><p>Explain when to use feature hashing.</p>
</li>
<li><p>Provide code examples and implementation tips.</p>
</li>
<li><p>Discuss the impact of hash collisions.</p>
</li>
</ul>
<h2 id="heading-concluding-thoughts">Concluding thoughts</h2>
<p>Here, we conclude the most useful encoding techniques for categorical variables for your data science and machine learning tasks.</p>
<p>Encoding data features is a crucial step in any machine learning pipeline and I hope that this article serves as a ready reference for all your upcoming projects.</p>
<p>Each technique has its strengths and is best suited for specific scenarios. Make sure to refer to the <strong>"When to Use"</strong> section for each encoding technique to apply the right feature encoding technique to your dataset.</p>
<p>The reference code is compiled for you in the notebook below.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/utkarshkant/Helpful-Python/blob/master/Encoding_Categorical_Features.ipynb">https://github.com/utkarshkant/Helpful-Python/blob/master/Encoding_Categorical_Features.ipynb</a></div>
<p> </p>
<hr />
<p>Hope you enjoyed this!</p>
<p>Feel free to reach out for any queries or feedback below or on my socials.</p>
<blockquote>
<ul>
<li><p><a target="_blank" href="https://www.linkedin.com/in/utkarsh-kant/">LinkedIn</a></p>
</li>
<li><p><a target="_blank" href="https://twitter.com/kantschants">X</a></p>
</li>
<li><p><a target="_blank" href="https://www.youtube.com/@kantschants4139">YouTube</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/utkarshkant">Github</a></p>
</li>
</ul>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Assumptions of Linear Regression - Ace the Most Asked Interview Question]]></title><description><![CDATA[Introduction
Linear Regression is one of the most popular statistical models and machine learning algorithms. Considered the holy grail in the world of Data Science and Machine Learning.
It is one of the first (if not the first) algorithms that is th...]]></description><link>https://kantcodes.com/assumptions-of-linear-regression</link><guid isPermaLink="true">https://kantcodes.com/assumptions-of-linear-regression</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Linear Regression]]></category><category><![CDATA[interview]]></category><category><![CDATA[statistics]]></category><category><![CDATA[probability]]></category><dc:creator><![CDATA[Utkarsh Kant]]></dc:creator><pubDate>Fri, 11 Aug 2023 04:33:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1691655440955/d240154a-ceb7-47a7-80a7-dc76d400802e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p><strong>Linear Regression</strong> is one of the most popular statistical models and machine learning algorithms. Considered the holy grail in the world of Data Science and Machine Learning.</p>
<p>It is one of the first (if not the first) algorithms that is thought in ML schools and courses alike.</p>
<p>However, one of the most important aspects that a lot of tutorials skip is that Linear Regression cannot be applied to all datasets alike. There are certain mandates that a dataset and its distribution must follow for Linear Regression to be successfully modeled to it.</p>
<p>These are popularly also known as the <strong>Assumptions of Linear Regression</strong>.</p>
<blockquote>
<p>💡 Assumptions of Linear Regression model is a favorite interview question for the Data Scientist and Machine Learning Engineer positions.</p>
</blockquote>
<p>In this article, we will not only list the different assumptions of a linear regression model but also discuss why they are so, and the rationale behind each of them.</p>
<blockquote>
<p>The prerequisite for this discussion is a good understanding of the Linear Regression algorithm itself.</p>
</blockquote>
<p>So let’s go! 🚀</p>
<h1 id="heading-a-quick-review-of-linear-regression">A quick review of Linear Regression</h1>
<p>We know that the Linear Regression model aims at establishing the <strong>best-fit line</strong> between the dependent and independent features of a dataset as shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691657033586/174f5aad-4386-4422-a432-64afce961754.png" alt class="image--center mx-auto" /></p>
<p><em>Figure:</em> <code>y = 3 + 5x + np.random.rand(100, 1)</code></p>
<p>The Linear Regression model is defined as follows.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691676491989/582bf158-ade0-4260-91b5-324e228d37c6.png" alt class="image--center mx-auto" /></p>
<p>Now, let us discuss the assumptions of the Linear Regression model.</p>
<h1 id="heading-assumption-of-linear-regression-model">Assumption of Linear Regression Model</h1>
<p>The assumptions of Linear Regression are as follows:</p>
<ol>
<li><p><strong>Linearity</strong></p>
</li>
<li><p><strong>Homoscedasticity or Constant Error Variance</strong></p>
</li>
<li><p><strong>Independent Error Terms or No Autocorrelation</strong></p>
</li>
<li><p><strong>Normality of Residuals</strong></p>
</li>
<li><p><strong>No or Negligible Multi-collinearity</strong></p>
</li>
<li><p><strong>Exogeneity</strong></p>
</li>
</ol>
<blockquote>
<p>💡 NOTE</p>
<p>Different sources and textbooks might list a different number of assumptions of a linear regression model. And they are all correct.</p>
<p>However, <strong>the 6 assumptions that we will discuss today shall cover all of the different assumptions</strong>.</p>
<p>Many textbooks break individual assumptions into multiple different assumptions, therefore, can list out about 10 different assumptions.</p>
</blockquote>
<p>⭐ The significance of these assumptions can be understood as guidelines that if a dataset follows, becomes highly suitable for a Linear Regression model.</p>
<p>Alright! Let’s discuss each of these assumptions in detail.</p>
<h2 id="heading-1-linearity">1. Linearity</h2>
<p>This essentially means that <strong>there must be a linear relationship between the dependent and the independent features</strong> of a dataset.</p>
<p>And this is fairly intuitive as the best-fit line of a linear regression model is a straight line, which is most suitable for linear data distribution.</p>
<p>Compare the two different distributions below:</p>
<ul>
<li><p><strong>Data is linearly distributed</strong></p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691678534672/07ca5ee5-46b6-47d0-ab90-b1ab52902242.png" alt class="image--center mx-auto" /></p>
<p>  <em>Figure:</em> <code>y = 3 + 5x + np.random.rand(100, 1)</code></p>
</li>
<li><p><strong>Data is non-linearly distributed</strong></p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691678638630/c6ccea0a-07a9-4b1e-801e-6ffe1e0f98c0.png" alt class="image--center mx-auto" /></p>
<p>  <em>Figure:</em> <code>y = 3 + 50x^2 + np.random.rand(100, 1)</code></p>
</li>
</ul>
<p>We can clearly distinguish between the two different distributions that the linear regression model is a better fit for the linear distribution.</p>
<h3 id="heading-how-to-detect-linearity-between-dependent-amp-independent-features">How to detect linearity between dependent &amp; independent features?</h3>
<p>Well, one way is to plot the data and detect it visually. However, in real-world scenarios, it may not be so simple to detect linearity in data.</p>
<p>The <strong>Likelihood Ratio (LR) Test</strong> is a good test for establishing linearity.</p>
<h2 id="heading-2-homoscedasticity-or-constant-error-variance">2. Homoscedasticity or Constant Error Variance</h2>
<p>The second assumption of linear regression is <strong>Homoscedasticity</strong>.</p>
<p>It means that the residuals (or error terms) should have constant variance along the axis, in other words, the error terms must be evenly spread across the axis as shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691679237886/8967aa2d-8fab-43d4-8d69-4b460aaac922.png" alt class="image--center mx-auto" /></p>
<p><em>Figure: The residuals for a linearly distributed dataset have constant variance.</em></p>
<p>There are instances where the residuals are not evenly spread along the axis, and this condition is known as <strong>Heteroscedasticity</strong>. A few examples are shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691679292932/577dc895-06e7-4f9a-9993-d803b68cf61f.png" alt class="image--center mx-auto" /></p>
<p><em>Figure: Homoscedasticity vs Heteroscedasticity [</em><a target="_blank" href="https://d35fo82fjcw0y8.cloudfront.net/2016/06/03210521/homoscedasticity.png"><em>Source</em></a><em>]</em></p>
<p>When there is Heteroscedasticity in data, the standard errors cannot be relied upon and hence is a violation of the assumptions of Linear Regression.</p>
<h3 id="heading-how-to-detect-heteroscedasticity-in-data">How to detect Heteroscedasticity in data?</h3>
<p>Apart from visually detecting it, there are statistical tests for determining Heteroscedasticity, the popular ones are:</p>
<ol>
<li><p><strong>Goldfeldt-Quant test</strong></p>
</li>
<li><p><strong>Breusch-Pagan test</strong></p>
</li>
</ol>
<h3 id="heading-how-to-remove-heteroscedasticity-in-data">How to remove Heteroscedasticity in data?</h3>
<p>There are certain ways to remove Heteroscedasticity from your data, some of them are:</p>
<ol>
<li><p><strong>White’s standard errors</strong>: These are additive terms that sort of normalize the variance in the spread of residual terms, however, the downside is that the confidence in the coefficients of independent features also decreases.</p>
</li>
<li><p><strong>Weighted least squares</strong>: Updating the weights of independent features in the Linear Regression equation. This is a trial-and-error method that may lead to Homoscedasticity.</p>
</li>
<li><p><strong>Log transformations</strong>: Many times a curved distribution can be converted into a linear distribution (i.e., a straight line) by simply applying the log function to it. Other transformations may work out as well.</p>
</li>
</ol>
<h2 id="heading-3-independent-error-terms-or-no-autocorrelation">3. <strong>Independent Error Terms or No Autocorrelation</strong></h2>
<p>Here the assumption states that each residual term is not related to the other residual term occurring before or after it. A good example of this is shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691680567582/39bd4c0b-f352-4819-b713-b60e97a5b607.png" alt class="image--center mx-auto" /></p>
<p><em>Figure: The residuals for a linearly distributed dataset are independent of each other.</em></p>
<blockquote>
<p>💡 NOTE<br /><strong>Autocorrelation</strong> is the relation of the data series with itself, where the error term of the next data record is related to the residual of the previous data record.</p>
</blockquote>
<p>It is most often found in time-series data and not so prevalent in regular cross-sectional datasets. An example of a time-series distribution is shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691680482514/47c6443e-c1dd-4bff-abcb-814ba9b65c77.png" alt class="image--center mx-auto" /></p>
<p><em>Figure: Autocorrelation in time series data helps forecast future outcomes.</em></p>
<p>Therefore, it is not something that you may encounter very often, however, if you do it is a violation of the assumptions of linear regression.</p>
<p>With autocorrelation in the data, the standard error of the output becomes unreliable.</p>
<h3 id="heading-how-to-detect-autocorrelation">How to detect autocorrelation?</h3>
<p>There are a few tests for detecting autocorrelation in a dataset. Here are a few:</p>
<ol>
<li><p>ACF &amp; PACF plots</p>
</li>
<li><p>Durbin-Watson test</p>
</li>
</ol>
<h2 id="heading-4-normality-of-residuals">4. <strong>Normality of Residuals</strong></h2>
<p>This assumption states that the residuals of errors in the model must be normally distributed.</p>
<p>If the normality of errors is violated and the number of records is small, then the standard errors in output are affected. That impacts the best-fit line of the model.</p>
<blockquote>
<p>💡 NOTE<br />This assumption generally is considered a weak assumption for Linear Regression models and slight (or greater) violations can be neglected while modeling. This is particularly true for large datasets.</p>
</blockquote>
<h3 id="heading-how-to-detect-normality-in-errors">How to detect normality in errors?</h3>
<p>There are multiple visual and statistical tests for detecting normality in error terms. Some of the popular ones are:</p>
<ol>
<li><p><strong>Histogram</strong></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691680225933/ef4fdbb6-005d-49e6-8096-77d42d90dbc3.png" alt class="image--center mx-auto" /></p>
<p> <em>Figure: Residuals are normally distributed [</em><a target="_blank" href="https://cdn.aptech.com/www/uploads/2017/05/econ_tutorial_ols_resid_normality_1.png"><em>Source</em></a><em>]</em></p>
</li>
<li><p><strong>Q-Q Plot</strong></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691680185473/40c74b01-4eaf-4dff-8aa7-2da3d74e9d5b.png" alt class="image--center mx-auto" /></p>
<p> <em>Figure: Q-Q plot for normally distributed errors [</em><a target="_blank" href="https://i.stack.imgur.com/NpI0O.png"><em>Source</em></a><em>]</em></p>
</li>
<li><p><strong>Shapiro-Wilk test</strong></p>
</li>
<li><p><strong>Kolmogorov-Smirnov test</strong></p>
</li>
<li><p><strong>Anderson-Darling test</strong></p>
</li>
</ol>
<h3 id="heading-how-to-bring-normality-in-errors">How to bring normality in errors?</h3>
<p>As mentioned above, this is a weak assumption and can be neglected in many cases as well.</p>
<p>However, some ways to bring normality in residuals are:</p>
<ol>
<li><p>Mathematical transformations like log transformations etc.</p>
</li>
<li><p>Standardization or normalization of the dataset</p>
</li>
<li><p>Adding more data reduces the need for normally distributed error terms</p>
</li>
</ol>
<h2 id="heading-5-no-multi-collinearity">5. <strong>No Multi-collinearity</strong></h2>
<p>Multi-collinearity occurs when 2 or more features of a dataset are internally correlated with each other.</p>
<p>Consider a house price dataset with multiple variables about the property and price being the target variable. There is a high chance that the features 'floor area' and 'land dimensions' are highly correlated since the area is a direct multiple of individual dimensions.</p>
<p>Now, this is a problem for the regression model since what it effectively is trying to do is isolate the individual effects of each feature on the target variable. This is represented by the weights of each feature as shown below.</p>
<p>$$X = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$$</p><p>Therefore, it is highly recommended to verify that there is collinearity between individual features within a dataset.</p>
<h3 id="heading-how-does-this-affect-our-model">How does this affect our model?</h3>
<p>It disturbs the best-fit line by impacting the individual coefficients of the variables, which then becomes unreliable.</p>
<h3 id="heading-how-to-detect-multicollinearity">How to detect multicollinearity?</h3>
<ol>
<li><p>Calculating correlation (ρ) between each feature in the dataset.</p>
</li>
<li><p>Variance Inflation Factor (VIF)</p>
</li>
</ol>
<h3 id="heading-how-to-remove-multicollinearity">How to remove multicollinearity?</h3>
<ol>
<li><p>Simply removing one of the correlated variables.</p>
</li>
<li><p>Merging them into a single feature can prevent multicollinearity.</p>
<blockquote>
<p>⚠️ <strong>CAUTION!</strong></p>
<p>Merging correlated features into a single feature will only work if the new feature actually has real-world existence or impacts the target variable equally.</p>
</blockquote>
</li>
</ol>
<h2 id="heading-6-exogeneity-or-no-endogeneity">6. <strong>Exogeneity (or</strong> No Endogeneity)</h2>
<p>Exogeneity or no omitted variable bias is the final assumption on our list.</p>
<p>But let’s first understand what omitted variable bias actually is.</p>
<p>If there is a variable in the model that has been omitted and/or is not present but still impacts the target variable, then there is omitted variable bias or Endogeneity in the model.</p>
<p>For example, consider the following model.</p>
<p>$$UsedCarPrice_i = \beta_0 + \beta_1(DistanceTravelled)_i + \epsilon_i$$</p><p>Now the price of a used car here is determined by the distance it has already covered. However, the year of manufacturing also impacts both the target variable (Y), the car price of the used car, and the X variable, Distance traveled by car as the longer the age of the car, the more likely the car has traveled greater distances.</p>
<p>This is a clear case of omitted variable bias and it is undesirable for accurate modeling.</p>
<blockquote>
<p>💡 NOTE<br />Exogeneity in a model tells us that all features that impact the target variable (Y) are part of the model features (X) and no other external feature can be further included.</p>
</blockquote>
<h1 id="heading-summary">Summary</h1>
<p>So this was our discussion on the Assumptions of Linear Regression. This is one of the favorite questions of Data Scientist interviewers and now you know how to ace it!</p>
<p>Here is a quick summary of the same.</p>
<ol>
<li><p><strong>Linearity</strong>: There must be a linear relationship between the dependent and independent variables.</p>
</li>
<li><p><strong>Homoscedasticity or Constant Error Variance</strong>: The variance of the errors is constant across all levels of the independent variables.</p>
</li>
<li><p><strong>Independent Error Terms or No Autocorrelation</strong>: There is no correlation between the errors of the variables.</p>
</li>
<li><p><strong>Normality of Residuals</strong>: The residuals or errors follow a normal distribution.</p>
</li>
<li><p><strong>No multicollinearity</strong>: There exists no correlation between the different independent variables.</p>
</li>
<li><p><strong>Exogeneity (No Endogeneity)</strong>: There must be no relationship between the independent variables and the errors.</p>
</li>
</ol>
<p>Keep this list handy when you prepare for your interviews.</p>
<hr />
<p>Hope you enjoyed this! Feel free to leave your feedback and queries below.</p>
]]></content:encoded></item><item><title><![CDATA[Paraphrase with Transformer Models like T5, BART, Pegasus - Ultimate Guide]]></title><description><![CDATA[Introduction
Paraphrasing is a fundamental skill in effective communication. Whether you're a student, content creator, or professional writer, being able to rephrase information while preserving its essence is crucial.
With the rise of artificial in...]]></description><link>https://kantcodes.com/paraphrasing-with-transformer-t5-bart-pegasus</link><guid isPermaLink="true">https://kantcodes.com/paraphrasing-with-transformer-t5-bart-pegasus</guid><category><![CDATA[AI]]></category><category><![CDATA[natural language processing]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[nlp]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Utkarsh Kant]]></dc:creator><pubDate>Thu, 20 Jul 2023 08:46:19 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1689842686881/71c39fcb-30e8-4219-adaf-60843dbb3f81.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Paraphrasing is a fundamental skill in effective communication. Whether you're a student, content creator, or professional writer, being able to rephrase information while preserving its essence is crucial.</p>
<p>With the rise of artificial intelligence (AI), transformer models have emerged as powerful tools for automating and enhancing the paraphrasing process.</p>
<h2 id="heading-understanding-paraphrasing">Understanding Paraphrasing</h2>
<p>As per Oxford, <strong>Paraphrasing</strong> means <em>"to express the meaning of (something written or spoken) using different words, especially to achieve greater clarity"</em>.</p>
<p>Let's look at the below example:</p>
<blockquote>
<p>Original sentence: "The cat is sitting on the mat."<br />Paraphrased sentence: "The mat has a cat sitting on it."</p>
</blockquote>
<p>Both the sentences while constructed differently, had similar meaning and context. This is paraphrasing.</p>
<h2 id="heading-whats-inside">What's inside 🔍</h2>
<p>In this article, we will explore the world of effective &amp; intelligent paraphrasing with transformer models. We'll dive into the underlying concepts of transformers and their advantages over conventional methods.</p>
<p>Additionally, we'll discuss popular transformer models such as BART, T5, and Pegasus that have been specifically designed for paraphrasing tasks.</p>
<p>By the end of this article, you'll have a comprehensive understanding of how transformer models are revolutionizing paraphrasing, and empowering individuals and industries with their transformative capabilities.</p>
<p>And more importantly, how you can build a nifty transformer for yourself.</p>
<p>Let's embark on this journey to unlock the power of AI in effective paraphrasing! 🚀</p>
<p><strong>NOTE</strong>: This article is more focused on the applications and not theory, refer to this article to understand how transformers work internally.</p>
<h2 id="heading-transformer-models-for-paraphrasing">Transformer Models for Paraphrasing</h2>
<p>In the realm of paraphrasing, transformer models offer significant advantages over traditional approaches.</p>
<p>Unlike previous methods that relied heavily on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers employ self-attention mechanisms.</p>
<p>This enables them to focus on relevant words and phrases, facilitating a deeper understanding of the underlying semantics.</p>
<p>With their ability to capture long-range dependencies and contextual information through attention mechanisms, transformers have revolutionized various language-related tasks, including paraphrasing.</p>
<h2 id="heading-popular-transformer-models-for-paraphrasing">Popular Transformer Models for Paraphrasing</h2>
<p>Several popular transformer models have been specifically developed for paraphrasing tasks.</p>
<p>All these transformers can be found in the Huggingface Library. Let's explore:</p>
<h3 id="heading-1-bart-bidirectional-and-auto-regressive-transformer">1. BART (Bidirectional and Auto-Regressive Transformer)</h3>
<p>BART is a powerful transformer model by Facebook AI.</p>
<p>It has been trained using denoising autoencoder objectives and is renowned for its ability to generate high-quality paraphrases.</p>
<p>BART has been trained extensively on large-scale datasets and excels in various NLP tasks, especially paraphrasing.</p>
<p>Source: <a target="_blank" href="https://huggingface.co/facebook/bart-base">https://huggingface.co/facebook/bart-base</a></p>
<h3 id="heading-2-t5-text-to-text-transfer-transformer">2. T5 (Text-To-Text Transfer Transformer)</h3>
<p>T5, developed by Google Research, is a versatile transformer model pre-trained using a text-to-text framework.</p>
<p>While its primary focus is on a wide range of NLP tasks, including translation and summarization, T5 can also be fine-tuned for paraphrasing.</p>
<p>Source: <a target="_blank" href="https://huggingface.co/t5-base">https://huggingface.co/t5-base</a></p>
<h3 id="heading-3-pegasus-paraphrase">3. Pegasus Paraphrase</h3>
<p>Pegasus Paraphrase is specifically trained for executing paraphrasing tasks.</p>
<p>Built upon the Pegasus architecture (originally built for text summarization), it leverages the power of transformer models to generate accurate and contextually appropriate paraphrases.</p>
<p>Source: <a target="_blank" href="https://huggingface.co/tuner007/pegasus_paraphrase">https://huggingface.co/tuner007/pegasus_paraphrase</a></p>
<h2 id="heading-paraphrasing-with-transformers">Paraphrasing with Transformers</h2>
<p>Now let us look at how to paraphrase content with these special transformers and also compare their outputs.</p>
<p>Let's first paraphrase a sentence and then extend that to paraphrase long-form content, which is our main goal.</p>
<h3 id="heading-paraphrasing-a-sentence">Paraphrasing a Sentence</h3>
<p>Let us paraphrase a few random sentences from modern English literature.</p>
<blockquote>
<p>"She was a storm, not the kind you run from, but the kind you chase." - R.H. Sin, Whiskey Words &amp; a Shovel III</p>
<p>"She wasn't looking for a knight, she was looking for a sword." - Atticus</p>
<p>"In the end, we only regret the chances we didn't take." - Unknown</p>
<p>"I dreamt I am running on sand in the night" - Yours truly ;)</p>
<p>"Long long ago, there lived a king and a queen. For a long time, they had no children." - Random text on the internet</p>
<p>"I am typing the best article on paraphrasing with Transformers." - You know who!</p>
</blockquote>
<h4 id="heading-bart">BART</h4>
<p>Here is the code to paraphrase the above two random English sentences with BART.</p>
<pre><code class="lang-python"><span class="hljs-comment"># imports</span>
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> BartTokenizer, BartForConditionalGeneration

<span class="hljs-comment"># Load pre-trained BART model and tokenizer</span>
model_name = <span class="hljs-string">'facebook/bart-base'</span>
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

<span class="hljs-comment"># Set up input sentences</span>
sentences = [
    <span class="hljs-string">"She was a storm, not the kind you run from, but the kind you chase."</span>,
    <span class="hljs-string">"In the end, we only regret the chances we didn't take."</span>,
    <span class="hljs-string">"She wasn't looking for a knight, she was looking for a sword."</span>,
    <span class="hljs-string">"I dreamt I am running on sand in the night"</span>
]

<span class="hljs-comment"># Paraphrase the sentences</span>
<span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> sentences:
    <span class="hljs-comment"># Tokenize the input sentence</span>
    input_ids = tokenizer.encode(sentence, return_tensors=<span class="hljs-string">'pt'</span>)

    <span class="hljs-comment"># Generate paraphrased sentence</span>
    paraphrase_ids = model.generate(input_ids, num_beams=<span class="hljs-number">5</span>, max_length=<span class="hljs-number">100</span>, early_stopping=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Decode and print the paraphrased sentence</span>
    paraphrase = tokenizer.decode(paraphrase_ids[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>)
    print(<span class="hljs-string">f"Original: <span class="hljs-subst">{sentence}</span>"</span>)
    print(<span class="hljs-string">f"Paraphrase: <span class="hljs-subst">{paraphrase}</span>"</span>)
    print()
</code></pre>
<p>Running the above code, we get the following output.</p>
<pre><code class="lang-markdown">Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind that you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking at a knight, she was looking for a sword.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: In the end, we only regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: Long long ago, there lived a king and a queen. For a long time, they had no children.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am typing the best article on paraphrasing with Transformers.
</code></pre>
<p>We see that BART is not super effective at paraphrasing sentences. Let's try the next transformer.</p>
<h4 id="heading-t5-text-to-text-transfer-transformer">T5 (Text-to-Text Transfer Transformer)</h4>
<p>Here is the code to paraphrase the above two random English sentences with T5.</p>
<pre><code class="lang-python"><span class="hljs-comment"># imports</span>
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> T5Tokenizer, T5ForConditionalGeneration

<span class="hljs-comment"># Load pre-trained T5 Base model and tokenizer</span>
tokenizer = T5Tokenizer.from_pretrained(<span class="hljs-string">"t5-base"</span>, model_max_length=<span class="hljs-number">1024</span>)
model = T5ForConditionalGeneration.from_pretrained(<span class="hljs-string">"t5-base"</span>)

<span class="hljs-comment"># Set up input sentences</span>
sentences = [
    <span class="hljs-string">"She was a storm, not the kind you run from, but the kind you chase."</span>,
    <span class="hljs-string">"In the end, we only regret the chances we didn't take."</span>,
    <span class="hljs-string">"She wasn't looking for a knight, she was looking for a sword."</span>,
    <span class="hljs-string">"I dreamt I am running on sand in the night"</span>
]

<span class="hljs-comment"># Paraphrase the sentences</span>
<span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> sentences:
    <span class="hljs-comment"># Tokenize the input sentence</span>
    input_ids = tokenizer.encode(sentence, return_tensors=<span class="hljs-string">'pt'</span>)

    <span class="hljs-comment"># Generate paraphrased sentence</span>
    paraphrase_ids = model.generate(input_ids, num_beams=<span class="hljs-number">5</span>, max_length=<span class="hljs-number">100</span>, early_stopping=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Decode and print the paraphrased sentence</span>
    paraphrase = tokenizer.decode(paraphrase_ids[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>)
    print(<span class="hljs-string">f"Original: <span class="hljs-subst">{sentence}</span>"</span>)
    print(<span class="hljs-string">f"Paraphrase: <span class="hljs-subst">{paraphrase}</span>"</span>)
    print()
</code></pre>
<p>And here's the output.</p>
<pre><code class="lang-markdown">Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking for a knight, she was looking for a sword.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We only regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: Long long ago, there lived a king and a queen. Long long ago, they had no children.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: Today I am typing the best article on paraphrasing with Transformers.
</code></pre>
<p>As we can see, the T5's output is a little different from BART's, but no significant improvement.</p>
<h4 id="heading-pegasus-paraphrase">Pegasus Paraphrase</h4>
<p>Finally, let's go over the code for Pegasus Paraphrase.</p>
<pre><code class="lang-python"><span class="hljs-comment"># imports</span>
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> PegasusTokenizer, PegasusForConditionalGeneration

<span class="hljs-comment"># load pre-trained Pegasus Paraphrase model and tokenizer</span>
tokenizer = PegasusTokenizer.from_pretrained(<span class="hljs-string">"tuner007/pegasus_paraphrase"</span>)
model = PegasusForConditionalGeneration.from_pretrained(<span class="hljs-string">"tuner007/pegasus_paraphrase"</span>)

<span class="hljs-comment"># input sentences</span>
sentences = [
    <span class="hljs-string">"She was a storm, not the kind you run from, but the kind you chase."</span>,
    <span class="hljs-string">"She wasn't looking for a knight, she was looking for a sword."</span>,
    <span class="hljs-string">"In the end, we only regret the chances we didn't take."</span>,
    <span class="hljs-string">"I dreamt I am running on sand in the night"</span>,
    <span class="hljs-string">"Long long ago, there lived a king and a queen. For a long time, they had no children."</span>,
    <span class="hljs-string">"I am typing the best article on paraphrasing with Transformers."</span>
]

<span class="hljs-comment"># Paraphrase the sentences</span>
<span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> sentences:
    <span class="hljs-comment"># Tokenize the input sentence</span>
    input_ids = tokenizer.encode(sentence, return_tensors=<span class="hljs-string">'pt'</span>)

    <span class="hljs-comment"># Generate paraphrased sentence</span>
    paraphrase_ids = model.generate(input_ids, num_beams=<span class="hljs-number">5</span>, max_length=<span class="hljs-number">100</span>, early_stopping=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Decode and print the paraphrased sentence</span>
    paraphrase = tokenizer.decode(paraphrase_ids[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>)
    print(<span class="hljs-string">f"Original: <span class="hljs-subst">{sentence}</span>"</span>)
    print(<span class="hljs-string">f"Paraphrase: <span class="hljs-subst">{paraphrase}</span>"</span>)
    print()
</code></pre>
<p>Here's the output.</p>
<pre><code class="lang-markdown">Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She was looking for a sword, not a knight.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I ran on the sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: They had no children for a long time.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am writing the best article on the subject.
</code></pre>
<p>We can observe a significant improvement in the output with Pegasus Paraphrase.</p>
<p>Comparing the output of all three transformer models, we can definitively declare Pegasus Paraphrase as the winner.</p>
<h3 id="heading-paraphrasing-a-paragraph">Paraphrasing a Paragraph</h3>
<p>With our testing out of the way, we've finalized Pegasus Paraphrase as our choice of transformer for this task.</p>
<p>Now let's see how we can paraphrase paragraphs and long chunks of texts with it.</p>
<p>Theoretically, there are three main ways to paraphrase whole paragraphs.</p>
<h4 id="heading-1-adjusting-the-input-length"><strong>1. Adjusting the input length</strong></h4>
<p>By default, the maximum input length for Pegasus Paraphrase is set to a certain number of tokens. If the input paragraph exceeds this limit, it might be truncated, leading to incomplete paraphrasing.</p>
<p>Here we split the longer text into smaller chunks and run them through the model individually, then combine the paraphrased results afterward.</p>
<h4 id="heading-2-use-a-sliding-window-approach"><strong>2. Use a sliding window approach</strong></h4>
<p>Here we take a fixed-sized window and slide it over the input paragraph, generating paraphrases for each window. This way, we ensure that the entire paragraph is covered, albeit with overlapping segments.</p>
<h4 id="heading-3-optimizing-the-beam-search"><strong>3. Optimizing the Beam Search</strong></h4>
<p>Beam search is a decoding algorithm that helps in generating diverse outputs from the model.</p>
<p>By default, the model uses beam search with a beam width of 4. We can try to increase the beam width to encourage more exploration and potentially improve the quality of paraphrased outputs for longer texts.</p>
<p>If neither approach gives us satisfactory results, we can look at fine-tuning the model but that's for a different discussion.</p>
<p>In my research and experimentation, I've found that 'Adjusting the input length' gives us the best output. So let's go ahead and implement that.</p>
<p>For a view on challenges with other methods, take a look at the experimentation notebook here.</p>
<p>{insert link to notebook}</p>
<p>Let's paraphrase a paragraph from 'The Hound of the Baskervilles', one of the most popular <em>Sherlock Holmes</em> stories by <em>Sir Arthur Conan Doyle</em>.</p>
<blockquote>
<p>"As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."</p>
</blockquote>
<pre><code class="lang-python"><span class="hljs-comment"># imports</span>
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> PegasusForConditionalGeneration, PegasusTokenizer

<span class="hljs-comment"># Load the Pegasus Paraphrase model and tokenizer</span>
model_name = <span class="hljs-string">"tuner007/pegasus_paraphrase"</span>
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

<span class="hljs-comment"># function to paraphrase long texts by adjusting the input length</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">paraphrase_paragraph</span>(<span class="hljs-params">text</span>):</span>

    <span class="hljs-comment"># Split the text into sentences</span>
    sentences = text.split(<span class="hljs-string">"."</span>)
    paraphrases = []

    <span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> sentences:
        <span class="hljs-comment"># Clean up sentences</span>

        <span class="hljs-comment"># remove extra whitespace</span>
        sentence = sentence.strip()

        <span class="hljs-comment"># filter out empty sentences</span>
        <span class="hljs-keyword">if</span> len(sentence) == <span class="hljs-number">0</span>:
            <span class="hljs-keyword">continue</span>

        <span class="hljs-comment"># Tokenize the sentence</span>
        inputs = tokenizer.encode_plus(sentence, return_tensors=<span class="hljs-string">"pt"</span>, truncation=<span class="hljs-literal">True</span>, max_length=<span class="hljs-number">512</span>)

        input_ids = inputs[<span class="hljs-string">"input_ids"</span>]
        attention_mask = inputs[<span class="hljs-string">"attention_mask"</span>]

        <span class="hljs-comment"># paraphrase</span>
        paraphrase = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=<span class="hljs-number">4</span>,
            max_length=<span class="hljs-number">100</span>,
            early_stopping=<span class="hljs-literal">True</span>
        )[<span class="hljs-number">0</span>]
        paraphrased_text = tokenizer.decode(paraphrase, skip_special_tokens=<span class="hljs-literal">True</span>)

        paraphrases.append(paraphrased_text)

    <span class="hljs-comment"># Combine the paraphrases</span>
    combined_paraphrase = <span class="hljs-string">" "</span>.join(paraphrases)

    <span class="hljs-keyword">return</span> combined_paraphrase

<span class="hljs-comment"># Example usage</span>
text = <span class="hljs-string">"As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."</span>
paraphrase = paraphrase_paragraph(text)
print(paraphrase)
</code></pre>
<p>Here we've split the sentences into smaller chunks like sentences, paraphrase each chunk and then combine the individual outputs back into a paragraph.</p>
<p>And below is the output.</p>
<blockquote>
<p>As Sir Henry and I sat at breakfast, the sunlight flooded in through the high windows, causing watery patches of color from the coats of arms. The dark panelling glowed like bronze in the golden rays, and it was hard to see that it was the chamber which had struck such a gloom into our souls the evening before. The evening before, Sir Henry's nerves were still handled and he came to breakfast, his cheeks flushed from the excitement of the early chase.</p>
</blockquote>
<h2 id="heading-concluding-thoughts">Concluding thoughts</h2>
<p>Throughout this article, we have explored the world of effective paraphrasing with transformer models. And also saw effective applications of how to build a paraphraser with Transformer models from Hugging Face.</p>
<p>Transformer models have brought about a paradigm shift in paraphrasing, empowering individuals and industries with their transformative capabilities. By harnessing the power of transformer models, we can unlock new possibilities in effective communication, content creation, academic writing, and language translation.</p>
<p>As the field of transformer-based paraphrasing continues to evolve, there are exciting opportunities for further exploration and adoption of these technologies.</p>
<p>Researchers and practitioners are encouraged to delve deeper into fine-tuning strategies, data augmentation techniques, and evaluation methodologies to advance the state-of-the-art in paraphrase generation.</p>
<p>Additionally, the ethical implications of using transformer models for paraphrasing should be considered. Careful attention should be given to biases and fairness to ensure equitable and responsible deployment of these technologies.</p>
<p>Let me know your thoughts and any feedback in the comments.</p>
<p>Until next time ... Ciao!</p>
]]></content:encoded></item><item><title><![CDATA[How to split your dataset into train, test, and validation sets?]]></title><description><![CDATA[Introduction
If you’ve been using the train_test_split method by sklearn to create the train, test, and validation datasets, then I know your pain.


Splitting datasets into the test, train, and validation datasets

While sklearn certainly provides u...]]></description><link>https://kantcodes.com/split-dataset-into-train-test-validation</link><guid isPermaLink="true">https://kantcodes.com/split-dataset-into-train-test-validation</guid><category><![CDATA[data analysis]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[Deep Learning]]></category><dc:creator><![CDATA[Utkarsh Kant]]></dc:creator><pubDate>Tue, 25 Apr 2023 14:52:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/oAa4BY3b-vo/upload/2db868572d3eca404d50d33da9b89a9a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>If you’ve been using the <code>train_test_split</code> method by <code>sklearn</code> to create the train, test, and validation datasets, then I know your pain.</p>
<blockquote>
<p><img src="https://miro.medium.com/v2/resize:fit:700/1*NQaN71ejH_eTUxhRLwiJcA.png" alt /></p>
<p>Splitting datasets into the test, train, and validation datasets</p>
</blockquote>
<p>While <code>sklearn</code> certainly provides us with a way to achieve our objective, however, it is a long-drawn-out procedure as we have to repeat the process twice adjusting the split ratio with every step.</p>
<p><strong>But rejoice,</strong> <code>fast_ml</code> <strong>is here!</strong></p>
<p>It offers a straightforward and to-the-point method to achieve the three different datasets with a single line of code.</p>
<p>It is the <code>train_valid_test_split</code> method!</p>
<p>It not only splits the data as we require but also separates the dependent variable <code>y</code> from the independent variables <code>X</code> in the same line of code.</p>
<h2 id="heading-code-walkthrough">Code walkthrough</h2>
<p>Let’s check out how it’s done (<a target="_blank" href="https://github.com/utkarshkant/25-short-code-snippets_Python">notebook</a>)!</p>
<p><strong><em>Step 1:</em></strong> Download the <code>fast_ml</code> library and Import the necessary packages and methods</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682426224027/00c5633e-53c8-4594-b459-8b5c51602e45.png" alt class="image--center mx-auto" /></p>
</blockquote>
<p><strong><em>Step 2:</em></strong> Load the dataset into a pandas data frame.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682426872886/796064bf-fb72-4eb2-96e4-f79b988dfd5f.png" alt class="image--center mx-auto" /></p>
</blockquote>
<p><strong><em>Step 3:</em></strong> Split the dataset</p>
<p>Once the data is loaded and ready to split, simply call the <code>train_valid_test_split</code> method and pass the dataset with the supporting parameters as below.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682426970820/a485c4a9-65d7-4d09-922f-020cdc5f6cf7.png" alt class="image--center mx-auto" /></p>
</blockquote>
<p>The datasets have been successfully split into train, test, and validation datasets. 🎉</p>
<blockquote>
<p><strong>💡 NOTE</strong><br />The split datasets retain their original index and resetting it is an optional step.</p>
</blockquote>
<p>You can now proceed with your modeling.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thanks to the team at <code>fast_ml</code>, the long-drawn-out task of splitting our dataset into independent and dependent features and then into training, testing, and validation datasets has been condensed into a single line of code. ⚡</p>
<p>You can find this notebook here:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/utkarshkant/25-short-code-snippets_Python/blob/master/train_valid_test_split.ipynb">https://github.com/utkarshkant/25-short-code-snippets_Python/blob/master/train_valid_test_split.ipynb</a></div>
<p> </p>
<p>Let me know how you liked this quick article in the comments below, and feed free to reach out!</p>
]]></content:encoded></item><item><title><![CDATA[Data Made Easy: A Comprehensive Guide for Beginners]]></title><description><![CDATA[Introduction
Data is all around us, from the information we process every day to the data collected by businesses to make informed decisions.
Businesses today are thriving on the data that they have collected over the years. This data is then utilize...]]></description><link>https://kantcodes.com/data-complete-guide</link><guid isPermaLink="true">https://kantcodes.com/data-complete-guide</guid><dc:creator><![CDATA[Utkarsh Kant]]></dc:creator><pubDate>Tue, 18 Apr 2023 05:02:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/KgLtFCgfC28/upload/dc0a5ddb105b0a78e2a18e0b1c345ccf.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Data is all around us, from the information we process every day to the data collected by businesses to make informed decisions.</p>
<p>Businesses today are thriving on the data that they have collected over the years. This data is then utilized intelligently to make informed business decisions.</p>
<p>But understanding the fundamentals of data itself and then utilizing it can be a daunting task, especially for beginners.</p>
<p>That's where this comprehensive guide comes in. We'll break down the concept of data at its most fundamental level, giving you the tools and techniques you need to handle it like a pro.</p>
<p>So let's dive in and make data easy!</p>
<h2 id="heading-so-what-is-data">So, what is data?</h2>
<blockquote>
<p>💡 All <strong>information</strong> essentially can be classified as <strong>data</strong>.</p>
</blockquote>
<p>It can come in multiple different forms, shapes, and sizes. It can be in the form of numbers, text, images, videos, and much more.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681124897398/212b48ef-4c74-44e8-a8dc-f909ea673db4.png" alt class="image--center mx-auto" /></p>
<p>Defining data</p>
</blockquote>
<h2 id="heading-isnt-data-a-simple-concept-why-should-we-learn-more-about-it">Isn’t data a simple concept, why should we learn more about it?</h2>
<p>By now we know that all information is data. And from our discussion on statistics, we also know that</p>
<blockquote>
<p>💡 Data lies at the heart of any analytical solution. Therefore, <strong>without data, there is no statistic</strong>. And without statistics, there is no analysis.</p>
</blockquote>
<p>The first and most crucial step of solving any problem, be it statistics, analytics, data science, machine learning, etc., is to understand the data at hand.</p>
<h2 id="heading-different-types-of-data">Different types of data</h2>
<p>We spoke about the different forms of data. And there are a few different ways of classifying data, and each serves a specific purpose.</p>
<p>Let’s go over the most popular types of data and see how they are classified.</p>
<p>The two major types of data are:</p>
<h3 id="heading-1-unstructured-data"><strong>1 — Unstructured data</strong></h3>
<blockquote>
<p>💡 As the name suggests, this type of data cannot be organized into a structure or a data model.</p>
</blockquote>
<p>Some of the popular examples are images, heatmaps, videos, spatial data, graph data, text documents, etc.</p>
<p>Unstructured data is not easily identifiable or interpretable either by humans or machines. It takes the machine some special features to process this data.</p>
<p>By now, you must have realized that this type of data is a bad fit for traditional relational databases like SQL.</p>
<h3 id="heading-2-structured-data"><strong>2 — Structured data</strong></h3>
<blockquote>
<p>💡 On the other hand, this type of data can be defined into a structure (as the name suggests)</p>
</blockquote>
<p>These are more commonly used in industrial settings and one of the most commonly used types of structured data is the 2-dimensional data structure, that is, the humble table, also known as <strong>rectangular data</strong>.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681125033936/219402a5-808b-4797-9720-0395c72bd75d.png" alt class="image--center mx-auto" /></p>
<p>A simple table capturing exam results of different students</p>
</blockquote>
<p>I'm sure you can think of enough use cases from your own life where you have used Excel spreadsheets to store some information.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681125060228/ff9cb65b-6f18-4e27-8836-df908418ce1a.png" alt class="image--center mx-auto" /></p>
<p>Another popular example of rectangular data is the <strong>Titanic dataset</strong> [<a target="_blank" href="https://www.kaggle.com/c/titanic/data">Source</a>]</p>
</blockquote>
<p>Structured data is further classified into a few different types of data. They are:</p>
<ol>
<li><p>Categorical data</p>
</li>
<li><p>Numerical data</p>
</li>
</ol>
<p>And even the above types of data can be classified further into different data types. Let's look at a complete breakdown before proceeding with each type of data.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681125110224/b193acad-2dfd-4819-b2f7-8f8829943336.png" alt class="image--center mx-auto" /></p>
<p>Different types of data</p>
</blockquote>
<h4 id="heading-21-categorical-data">2.1 — Categorical data</h4>
<blockquote>
<p>💡 The type of data that can be categorized [Genius <em>🕵️‍♂️].</em></p>
</blockquote>
<p>Now consider the dataset of students’ exam results. Depending on the grade, all students with grades other than <strong><em>F</em></strong> are deemed to have passed the examination.</p>
<p>So we add another column with the Passing status of each student. The column <strong><em>Pass/Fail</em></strong> has only one of the two entries, it can either be a <strong><em>Pass</em></strong> or a <strong><em>Fail</em></strong>.</p>
<blockquote>
<p><img src="https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F0eeb2c21-50d6-40fc-bffe-c9fc6442bdc0%2F1_WoqvQUDCRVumDhNVyoWZIQ.png?id=1fb5a3d7-7036-4372-8654-37de90e9f423&amp;table=block" alt="Exam results of different students" /></p>
<p>Exam results of different students</p>
</blockquote>
<p>Similarly, there are many instances where the entry in each row is one of the few available options. For example:</p>
<ul>
<li><p><strong>Binary data</strong>: <em>True</em> or <em>False</em>, <em>Yes</em> or <em>No</em>, <em>0</em> or <em>1</em></p>
</li>
<li><p><strong>Exam grades</strong>: <em>A</em>, <em>B</em>, <em>C</em>, <em>D</em>, <em>E</em>, and <em>F</em></p>
</li>
<li><p><strong>Laptop brands</strong>: Asus, Lenovo, Macbook, Dell, IBM, etc.</p>
</li>
</ul>
<blockquote>
<p>⚠️ The entry for a categorical data record can only be one of the available options. For example, it can be either True or False, but not both.</p>
</blockquote>
<p>Now there are 2 important types of categorical data as well, and they are:</p>
<h5 id="heading-211-nominal-data"><strong>2.1.1 — Nominal data</strong></h5>
<blockquote>
<p>💡 Type of categorical data that has <strong>no internal order</strong> or precedence amongst the different categories. The categories <strong>cannot be ranked</strong> one over the other.</p>
</blockquote>
<p>For example, in binary data like True or False, Male or Female, one category is not more important than the other.</p>
<blockquote>
<p>⚠️ There can be some <strong>exceptions</strong> here as well, refer to the upcoming exercise section of this article for an explanation.</p>
</blockquote>
<p>Another example would be subjects like English, Mathematics, Science, History, etc. As long as they carry equal weightage, one cannot have more importance than the other.</p>
<h5 id="heading-212-ordinal-data"><strong>2.1.2 — Ordinal data</strong></h5>
<p>Here the categories can be ordered and the order in which categories are spread matters.</p>
<p>For example, grades in exam results can be ordered as <em>A</em>, <em>B</em>, <em>C</em>, <em>D</em>, <em>E</em>, &amp; <em>F</em>, from higher rank to lower. Another simple ranking could be in the cloth sizes, which may range from <em>XS</em>, <em>S</em>, <em>M</em>, <em>L</em>, to <em>XL</em>.</p>
<blockquote>
<p>⚠️ <strong>SPOILER ALERT! 🤖</strong> The knowledge of Nominal and Ordinal datatypes becomes very critical during encoding for machine learning problems.</p>
</blockquote>
<h4 id="heading-22-numerical-data">2.2 — Numerical data</h4>
<blockquote>
<p>💡 While categorical data is discrete in nature. On the other hand, <strong>numerical data is continuous</strong>. It can have any numerical value.</p>
</blockquote>
<p>It can be either integers or real numbers. For example, the students’ marks in exams, the speed of a car, the length of a video, height, weight, etc.</p>
<p>Again, there are two major types of Numerical data, and they are:</p>
<h5 id="heading-221-discrete-data"><strong>2.2.1 — Discrete data</strong></h5>
<blockquote>
<p>💡 When the data records can be counted &amp; expressed only in <strong>whole numbers</strong>, it is called <strong>discrete data</strong>.</p>
</blockquote>
<p>For example, the number of children in a class, the number of cars owned by a person, the number of working days in a month, and many more.</p>
<h5 id="heading-222-continuous-data"><strong>2.2.2 — Continuous data</strong></h5>
<blockquote>
<p>💡 When the data records can be infinite or expressed in <strong>real numbers</strong> to many decimals places, the data is known as continuous data.</p>
</blockquote>
<p>For example, exact height, weight, &amp; other ratios like <strong>π</strong> cannot be recorded absolutely accurately in 2 decimal places.</p>
<h2 id="heading-some-special-data-types">Some special data types</h2>
<p>Apart from the above, there are a few other important data types that you should know.</p>
<h3 id="heading-1-time-series"><strong>1. Time series</strong></h3>
<blockquote>
<p>💡 Anything measured over time is <strong>time series</strong>.</p>
</blockquote>
<p>For example, daily or monthly stock prices, daily weather, hourly sea level, speed of a vehicle at every minute, etc.</p>
<p>More often than not, time series is structured data.</p>
<blockquote>
<p><img src="https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fe586a869-7c3e-4139-a342-0d45bc8ca62d%2F1_i8AjCOStPZaCewhKgEUWkg.png?id=0da04146-9fbb-4349-8f81-7c2b1147bfe1&amp;table=block" alt="An example of time-series data. Records of product demand, precipitation, &amp; temperature over the years." /></p>
<p>An example of time-series data. Records of product demand, precipitation, &amp; temperature over the years.</p>
</blockquote>
<h3 id="heading-2-text-data"><strong>2. Text data</strong></h3>
<blockquote>
<p>💡 Text data usually consists of documents containing words, sentences, and paragraphs of free-flowing text.</p>
</blockquote>
<p>It can be in any language. And is mostly unstructured.</p>
<p>A good example is the product reviews on Amazon, which can be utilized for sentiment classification. Or email contents that enable machine learning algorithms to detect spam emails from the rest.</p>
<blockquote>
<p><img src="https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fdb8b2aac-0529-48a1-b8ab-dbf51261aeee%2F1_DlRjl8PNH7-m8CrLPIIt7Q.png?id=9c22080f-418e-460c-a31d-db6b5f26f341&amp;table=block" alt="After filtering out the spam, Gmail automatically categorizes emails into Primary, Social, &amp; Promotions based on the text data in the email contents." /></p>
<p>After filtering out the spam, Gmail automatically categorizes emails into Primary, Social, &amp; Promotions based on the text data in the email contents.</p>
</blockquote>
<h3 id="heading-3-image-andamp-video-data"><strong>3. Image &amp; Video data</strong></h3>
<blockquote>
<p>💡 This is the graphic or pictorial data like images or drawings.</p>
</blockquote>
<p>This finds a great use case in object detection, self-driving cars, etc. It is a form of unstructured data as well.</p>
<blockquote>
<p><img src="https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fadec1039-159b-441b-b505-b53172f68a73%2F0_tizHOacaczfaol7u.jpg?id=749d1100-a217-4315-a867-a1dacbde1b91&amp;table=block" alt="Object detection from live footage." /></p>
<p>Object detection from live footage.</p>
</blockquote>
<h3 id="heading-4-audio-data">4. Audio data</h3>
<blockquote>
<p>💡 Any information recorded in the audio format is data.</p>
</blockquote>
<p>Another popular format of data is Audio, that is widely used in machine learning applications. Apps like <strong><em>Shazam</em></strong> are a great example of the same.</p>
<p>Be it a song, a speech, an audiobook, or any other information recorded in the audio format can be used as data.</p>
<h2 id="heading-assignment">Assignment</h2>
<p>Now that we have quickly understood so many different concepts, let’s strengthen our understanding with these fun exercises.</p>
<h3 id="heading-assignment-1"><strong>Assignment 1</strong></h3>
<p>Let’s classify each variable in the Titanic dataset into its correct data type.</p>
<blockquote>
<p>![Titanic dataset [](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fa2f6872e-2d9e-487d-b9d7-278749de2e50%2FUntitled.png?id=f78c9c8b-eaab-4b74-b6d8-5dfaab22f298&amp;table=block align="left")</p>
<p>Titanic dataset [<a target="_blank" href="https://www.kaggle.com/c/titanic/data">Source</a>]</p>
</blockquote>
<ul>
<li><p><strong><em>PassengerId</em></strong>: The unique Id for each passenger. This is numerical data and is discrete.</p>
</li>
<li><p><strong><em>Survived</em></strong>: Passenger survived or not, 0 = No, 1 = Yes. This is categorical data and nominal.</p>
</li>
<li><p><strong><em>Pclass</em></strong>: This is the ticket type, class 1 = 1st, 2 = 2nd, 3 = 3rd. This is categorical data and ordinal.</p>
</li>
<li><p><strong><em>Name</em></strong>: Name of the passenger. This is text data.</p>
</li>
<li><p><strong><em>Sex</em></strong>: Gender of the passenger. This is binary categorical data and ordinal.</p>
</li>
</ul>
<blockquote>
<p>⚠️ <strong>IMPORTANT</strong></p>
<p>In some cases, even data records like sex or gender can be ordinal, and this is one such case. This is because the captain of the ship explicitly issued an order for women and children to be saved first. As a result, the survival rate for women was three times higher than for men [<a target="_blank" href="https://www.newscientist.com/article/dn22119-sinking-the-titanic-women-and-children-first-myth/">Source</a>]. Therefore, while modeling, the algorithm can give a slightly higher preference to females while predicting the survival status.</p>
</blockquote>
<p>Similarly, we can analyze the rest of the variables of this dataset. I will leave this exercise for you to complete.</p>
<h3 id="heading-assignment-2"><strong>Assignment 2</strong></h3>
<p>Considered a video being streamed on Youtube. Multiple different data points are being recorded in real-time simultaneously.</p>
<p>Some of them are video, audio, images, resolutions, time stamps, total people watching at each timestamp, likes, dislikes, text data from the continuous chat, number of comments, transactions, engagement, and much more.</p>
<p>Your task is to classify each feature being recorded into its correct class. Do share your observations in the comments section.</p>
<h2 id="heading-summary">Summary</h2>
<p>So let’s summarize what we discussed today.</p>
<ol>
<li><p>Data is everywhere and every recorded information is data</p>
</li>
<li><p>Data lies at the heart of any statistical analysis</p>
</li>
<li><p>Different types of data, a quick breakdown is below.</p>
</li>
</ol>
<blockquote>
<p><img src="https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F28c5a044-c1b7-4048-afc0-10e535052ad3%2F1_8yjqvgia2bFEpkbgc2hyGA.png?id=381caa63-4fa1-4cfe-a827-b44d7206f8b3&amp;table=block" alt="Different types of data" /></p>
<p>Different types of data</p>
</blockquote>
<h2 id="heading-how-a-machine-reads-data">How a machine reads data?</h2>
<p>Finally, after understanding the foundations and different types of data.</p>
<p>Let's understand how machines are reading and interpreting data as opposed to humans.</p>
<p>Foundationally, no matter the type of data, the machines can only ingest 0s and 1s. Therefore, for us to train a machine learning model on our data, we must convert it into 0s and 1s.</p>
<p>Be it images, text, audio, or any other data type, everything has to be converted into 0s and 1s (or numeric) before feeding it to the machine.</p>
<blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681126358562/626263b5-2cc5-4ad5-a2db-bd6ff580c051.png" alt class="image--center mx-auto" /></p>
<p>Converting image to 0s and 1s</p>
</blockquote>
<p>NOTE: We will look at multiple examples in the coming discussions where we build machine-learning models on different types of data.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I am certain that this discussion will help you better understand your data at a more fundamental level, which will refine your analysis.</p>
<p>Feel free to share your feedback or queries in the comments below.</p>
]]></content:encoded></item></channel></rss>