Skip to main content

Command Palette

Search for a command to run...

How to split your dataset into train, test, and validation sets?

In a single line of code!

Updated
2 min read
How to split your dataset into train, test, and validation sets?
U

AI Engineer by the day . . . Batman by Knight!

Introduction

If you’ve been using the train_test_split method by sklearn to create the train, test, and validation datasets, then I know your pain.

Splitting datasets into the test, train, and validation datasets

While sklearn certainly provides us with a way to achieve our objective, however, it is a long-drawn-out procedure as we have to repeat the process twice adjusting the split ratio with every step.

But rejoice, fast_ml is here!

It offers a straightforward and to-the-point method to achieve the three different datasets with a single line of code.

It is the train_valid_test_split method!

It not only splits the data as we require but also separates the dependent variable y from the independent variables X in the same line of code.

Code walkthrough

Let’s check out how it’s done (notebook)!

Step 1: Download the fast_ml library and Import the necessary packages and methods

Step 2: Load the dataset into a pandas data frame.

Step 3: Split the dataset

Once the data is loaded and ready to split, simply call the train_valid_test_split method and pass the dataset with the supporting parameters as below.

The datasets have been successfully split into train, test, and validation datasets. 🎉

💡 NOTE
The split datasets retain their original index and resetting it is an optional step.

You can now proceed with your modeling.

Conclusion

Thanks to the team at fast_ml, the long-drawn-out task of splitting our dataset into independent and dependent features and then into training, testing, and validation datasets has been condensed into a single line of code. ⚡

You can find this notebook here:

Let me know how you liked this quick article in the comments below, and feed free to reach out!

Data Fundamentals

Part 2 of 3

This series will include articles that lay a strong foundation for any data enthusiast or working professional. This series aims at filling the gaps that a self-taught engineer may have.

Up next

Data Made Easy: A Comprehensive Guide for Beginners

A must read for all data enthusiasts!