Use Datasets for Building Models
Provide training and validation data to the model build process using Pandas DataFrames or DatasetGenerator.
The model.build()
method requires data to train and evaluate the machine learning model. You can provide this data in two main ways:
- List of Pandas DataFrames: The simplest way for tabular data.
- List of
plexe.DatasetGenerator
objects: Useful for generating synthetic data or augmenting existing datasets.
Using Pandas DataFrames
If your data is already loaded into Pandas DataFrames, pass a list containing these DataFrames to the datasets
argument of model.build()
.
Plexe will internally assign default names (like dataset_0
, dataset_1
) to these DataFrames and use them during the build process. If schemas are not provided, they will be inferred from these DataFrames.
Using DatasetGenerator
The DatasetGenerator
class allows you to define requirements for a dataset, potentially generating synthetic data or augmenting existing data using an LLM provider.
Use Case 1: Generating purely synthetic data
Use Case 2: Augmenting existing data
You can provide an existing DataFrame to DatasetGenerator
and potentially use it as a base for generating more samples (although the exact augmentation mechanism within build
needs confirmation).
Choose the method that best fits how your data is structured and whether you need synthetic data generation capabilities.