Datasets and Schemas
How Plexe handles data inputs and defines the structure of ML models.
In Plexe, datasets and schemas work together to define what data your model consumes and what outputs it produces. Understanding how these components interact is key to creating effective models.
Datasets
Datasets provide the training data for your models. Plexe offers flexible options for working with data:
Pandas DataFrames
The most common way to provide data to Plexe is through Pandas DataFrames, which offer a tabular structure that’s ideal for machine learning:
When using DataFrames:
- Plexe works with one or more DataFrames (pass them as a list)
- It automatically analyzes columns to understand data types and identify potential targets
- For multiple DataFrames, Plexe can detect relationships between them based on column names and data
DatasetGenerator
For cases where you have limited or no existing data, you can use the DatasetGenerator
class to create synthetic training data:
The DatasetGenerator
:
- Creates realistic synthetic data based on the schema and description
- Requires a provider specification for the LLM that will generate the data
- Can augment existing data by passing an existing DataFrame in the constructor
- Explicitly generates samples with the
generate()
method before model building
Schemas
Schemas define the structure and validation rules for inputs and outputs of your model. Plexe supports two ways to define schemas:
Dictionary-Based Schemas
The simplest way to define schemas is with Python dictionaries mapping field names to types:
Dictionary schemas:
- Are intuitive and quick to define
- Support basic Python types (
int
,float
,str
,bool
) - Are automatically converted to Pydantic models internally
Pydantic Models
For more complex schemas with validation rules, descriptions, or nested structures, Pydantic models provide greater flexibility:
Pydantic schemas:
- Provide rich validation (min/max values, regex patterns, etc.)
- Support field descriptions that help the ML Engineer agent understand the data
- Handle complex nested structures and optional fields
- Enable detailed documentation
Schema Inference
If you don’t provide explicit schemas but do provide datasets, Plexe will attempt to infer the schemas:
During schema inference:
- Plexe analyzes the DataFrame columns to determine types
- It uses the model intent and column names to identify likely target variables
- Input features and output targets are separated based on this analysis
How Plexe Uses Schemas and Datasets Internally
When you call model.build()
:
- Schema Analysis: Plexe examines the provided schemas (or infers them) to understand the data structure
- Data Exploration: The ML agent explores the datasets to understand patterns, distributions, and relationships
- Feature Engineering: The agent determines which transformations or feature engineering steps are needed
- Model Selection: Based on the schemas and data, the agent selects appropriate ML algorithms
- Training: The model is trained on the provided data
- Validation: Plexe ensures the trained model adheres to the output schema
Best Practices
- Provide Explicit Schemas when possible for clarity and more accurate model building
- Use Descriptive Field Names that clearly communicate the meaning of each field
- Add Field Descriptions in Pydantic models to guide the ML agent
- Clean Your Data before passing to Plexe for better results
- Ensure Representative Data to help Plexe build more accurate models