Datasets Reference
Detailed reference documentation for dataset handling in the Plexe Python library.
Plexe provides flexible options for working with datasets. This reference documents how to prepare, provide, and generate data for model building.
Supported Dataset Types
Plexe accepts two types of objects in the datasets
parameter of model.build()
:
- Pandas DataFrames: For providing tabular data directly
- DatasetGenerator objects: For generating synthetic data or augmenting existing data
Using Pandas DataFrames
Pandas DataFrames are the most common way to provide data to Plexe.
Basic Usage
Multiple DataFrames
You can provide multiple DataFrames for more complex scenarios:
When multiple DataFrames are provided, Plexe’s ML Engineer agent will attempt to determine relationships between them based on column names and data types.
DataFrame Requirements
While Plexe is flexible, following these guidelines helps ensure optimal results:
- Clean Data: Remove or impute missing values when possible
- Appropriate Types: Ensure columns have appropriate data types
- Meaningful Names: Use descriptive column names
- Reasonable Size: Keep DataFrames under a few million rows for optimal performance
DatasetGenerator
The DatasetGenerator
class allows you to generate synthetic data or augment existing data using LLMs.
Class Definition
Parameter | Type | Description | |
---|---|---|---|
description | str | Human-readable description of the dataset | |
provider | str | LLM provider used for synthetic data generation | |
schema | `Type[BaseModel] | Dict[str, type]` | The schema the data should match, if any. Can be a Pydantic model or dictionary. |
data | pd.DataFrame | A dataset of real data on which to base the generation, if available |
Generating Synthetic Data
To generate completely synthetic data:
Augmenting Existing Data
To augment an existing but limited dataset:
Generation Parameters
The generation process is controlled internally based on the description
and schema
provided. The description should give clear guidance about:
- The general nature of the data
- Important patterns or correlations
- Distributions of values
- Constraints beyond what’s defined in the schema
- Relationships between fields
Dataset Schema Details
When defining a schema for the DatasetGenerator
, use Pydantic’s Field
attributes to provide rich information:
Combining DataFrame and Generator
You can use both types together in the datasets
parameter:
Data Conversion Internals
Internally, Plexe performs several steps when working with data:
- DataFrame Validation: Ensures DataFrames have the expected structure
- Schema Inference: If not provided explicitly, infers schemas from the data
- Type Conversion: Ensures data types match schema requirements
- Data Splitting: Automatically splits data for training and validation
- Synthetic Generation: Executes when
DatasetGenerator
objects are provided - Feature Engineering: The ML Engineer agent determines appropriate transformations based on the data
Schema Inference
If input/output schemas aren’t explicitly provided, but datasets are, Plexe attempts to:
- Determine data types from the DataFrame columns
- Identify the likely target (output) variable(s) based on the model intent
- Classify remaining columns as input features
Best Practices
- Provide Clear Schemas: Explicit schemas help guide the model building process
- Clean Your Data: Remove irrelevant columns, handle missing values
- Use Descriptive Names: Clear column names help Plexe understand the data
- Include Domain Knowledge: Add rich descriptions to schema fields
- Combine Approaches: Use real data when available and synthetic data when needed
Performance Considerations
- Memory Usage: Large DataFrames consume more memory
- Generation Time: Synthetic data generation can take time, especially for complex schemas
- LLM Costs: Data generation involves LLM API calls, which may incur costs
By leveraging these options for dataset handling, you can provide Plexe with the data it needs to build effective machine learning models, even in scenarios where limited data is available.