In machine learning tasks involving image data, it’s crucial to split your dataset into separate test, training, and validation sets. This splitting ensures that your model is trained on one set of data, evaluated on a different set (validation), and finally tested on a completely unseen set of data (test). Manually splitting large image datasets can be a tedious and error-prone process, especially when dealing with thousands or millions of images.
Fortunately, the Python library split-folders provides a convenient solution for automatically splitting image folders into test, training, and validation sets with customizable ratios. In this article, we’ll explore how to use this library and understand the process of splitting image folders using a practical example.
Table of Contents
Split Image Folders into Train, Validation, and Test Sets with Python’s split-folders Library
Step 1: Setting up a Virtual Environment
If you’re using a virtual environment, make sure to activate it before proceeding. To activate your virtual environment:
On Unix-based systems (e.g., macOS, Linux):
source env/bin/activate
On Windows:
env\Scripts\activate
You should see (env) preceding your command prompt, indicating that the virtual environment is active.
Step 2: Installing split-folders
With your virtual environment activated, you can now install the split-folders library. You can do this using pip, Python’s package installer:
pip install split-folders
Alternatively, if you’re using Anaconda or Miniconda, you can install the library via conda:
conda install -c conda-forge split-folders
Step 3: Understanding the split-folders Command
The split-folders command is structured as follows:
Let’s break down the different components of this command:
--output OUTPUT_DIR: This argument specifies the directory where the split datasets will be stored. In our example, we used --output dataset.
--ratio TRAIN_RATIO VAL_RATIO TEST_RATIO: This argument specifies the ratios for splitting the data into training, validation, and test sets, respectively. In our example, we used --ratio .7 .1 .2, which means 70% of the data will be used for training, 10% for validation, and 20% for testing.
--: This double hyphen separates the arguments from the input directory path.
INPUT_DIR: This is the path to the directory containing the images you want to split.
This command splits the images in the PlantVillage directory into three sets: 70% for training, 10% for validation, and 20% for testing. The split datasets are stored in the dataset directory.
Step 4: Executing the Command
When you execute the split-folders command, you should see an output similar to the following:
This output indicates that the library is copying the files from the input directory (PlantVillage) and splitting them into the specified sets.
Step 5: Exploring the Output
After the command has finished executing, you can explore the output directory (dataset in our example) and its subdirectories:
dataset
├── train
├── val
└── test
The train directory contains approximately 70% of the original images (in our example, around 1,506 files), the val directory contains approximately 10% (around 215 files), and the test directory contains approximately 20% (around 430 files).
Customizing the Command
The beauty of the split-folders command lies in its flexibility. You can customize the ratios for splitting the data according to your needs. For example, if you want a 60-20-20 split for train, validation, and test sets, respectively, you would use the following command:
This command will split the images in the my_image_folder directory into three sets: 60% for training (my_dataset/train), 20% for validation (my_dataset/val), and 20% for testing (my_dataset/test).
You can also specify the output directory name using the --output argument, as shown in the example above.
Conclusion
The split-folders library is a powerful tool for splitting image datasets into test, training, and validation sets with customizable ratios. By following the steps outlined in this article, you can easily split your image folders and prepare your data for machine learning tasks.
Explore the split-folders library further and consider using it in your own machine learning projects involving image data.