Datasets and Schemas

In Plexe, datasets and schemas work together to define what data your model consumes and what outputs it produces. Understanding how these components interact is key to creating effective models.

Datasets

Datasets provide the training data for your models. Plexe offers flexible options for working with data:

Pandas DataFrames

The most common way to provide data to Plexe is through Pandas DataFrames, which offer a tabular structure that’s ideal for machine learning:

import pandas as pd
import plexe

# Load data into a DataFrame
df = pd.read_csv("customer_data.csv")

# Create a model and build it with the DataFrame
model = plexe.Model(intent="Predict customer churn")
model.build(datasets=[df])

When using DataFrames:

Plexe works with one or more DataFrames (pass them as a list)
It automatically analyzes columns to understand data types and identify potential targets
For multiple DataFrames, Plexe can detect relationships between them based on column names and data

DatasetGenerator

For cases where you have limited or no existing data, you can use the DatasetGenerator class to create synthetic training data:

from plexe import DatasetGenerator
from pydantic import BaseModel

# Define schema for synthetic data
class CustomerData(BaseModel):
    age: int
    subscription_months: int
    monthly_spend: float
    churn: bool

# Create a generator for synthetic data
generator = DatasetGenerator(
    description="Generate customer data with churn information",
    provider="openai/gpt-4o-mini",  # Specify LLM provider for generation
    schema=CustomerData
)

# Generate data (specify number of samples)
generator.generate(num_samples=100)

# Use the generator when building the model
model = plexe.Model(intent="Predict customer churn")
model.build(datasets=[generator])

The DatasetGenerator:

Creates realistic synthetic data based on the schema and description
Requires a provider specification for the LLM that will generate the data
Can augment existing data by passing an existing DataFrame in the constructor
Explicitly generates samples with the generate() method before model building

Schemas

Schemas define the structure and validation rules for inputs and outputs of your model. Plexe supports two ways to define schemas:

Dictionary-Based Schemas

The simplest way to define schemas is with Python dictionaries mapping field names to types:

import plexe

# Create a model with dictionary schemas
model = plexe.Model(
    intent="Predict house prices",
    input_schema={
        "square_footage": float,
        "bedrooms": int,
        "bathrooms": float,
        "location": str
    },
    output_schema={
        "price": float
    }
)

Dictionary schemas:

Are intuitive and quick to define
Support basic Python types (int, float, str, bool)
Are automatically converted to Pydantic models internally

Pydantic Models

For more complex schemas with validation rules, descriptions, or nested structures, Pydantic models provide greater flexibility:

import plexe
from pydantic import BaseModel, Field
from typing import List, Optional

# Define input schema
class PropertyFeatures(BaseModel):
    square_footage: float = Field(description="Total area in square feet", gt=0)
    bedrooms: int = Field(description="Number of bedrooms", ge=0)
    bathrooms: float = Field(description="Number of bathrooms (e.g., 2.5)", ge=0)
    location: str = Field(description="Neighborhood or area")
    amenities: Optional[List[str]] = Field(default=None, description="List of property amenities")

# Define output schema
class PricePrediction(BaseModel):
    price: float = Field(description="Predicted sale price in USD", gt=0)
    confidence: Optional[float] = Field(default=None, description="Prediction confidence score (0-1)")

# Create model with Pydantic schemas
model = plexe.Model(
    intent="Predict house prices based on property features",
    input_schema=PropertyFeatures,
    output_schema=PricePrediction
)

Pydantic schemas:

Provide rich validation (min/max values, regex patterns, etc.)
Support field descriptions that help the ML Engineer agent understand the data
Handle complex nested structures and optional fields
Enable detailed documentation

Schema Inference

If you don’t provide explicit schemas but do provide datasets, Plexe will attempt to infer the schemas:

import pandas as pd
import plexe

# Load data
df = pd.read_csv("housing_data.csv")

# Create model without schemas
model = plexe.Model(intent="Predict house prices based on property features")

# Build with data - schemas will be inferred
model.build(datasets=[df])

# Access the inferred schemas
print("Inferred input schema:", model.input_schema)
print("Inferred output schema:", model.output_schema)

During schema inference:

Plexe analyzes the DataFrame columns to determine types
It uses the model intent and column names to identify likely target variables
Input features and output targets are separated based on this analysis

How Plexe Uses Schemas and Datasets Internally

When you call model.build():

Schema Analysis: Plexe examines the provided schemas (or infers them) to understand the data structure
Data Exploration: The ML agent explores the datasets to understand patterns, distributions, and relationships
Feature Engineering: The agent determines which transformations or feature engineering steps are needed
Model Selection: Based on the schemas and data, the agent selects appropriate ML algorithms
Training: The model is trained on the provided data
Validation: Plexe ensures the trained model adheres to the output schema

Best Practices

Provide Explicit Schemas when possible for clarity and more accurate model building
Use Descriptive Field Names that clearly communicate the meaning of each field
Add Field Descriptions in Pydantic models to guide the ML agent
Clean Your Data before passing to Plexe for better results
Ensure Representative Data to help Plexe build more accurate models

Introduction

Plexe Python Library

Plexe Platform

Datasets

Pandas DataFrames

DatasetGenerator

Schemas

Dictionary-Based Schemas

Pydantic Models

Schema Inference

How Plexe Uses Schemas and Datasets Internally

Best Practices

Introduction

Plexe Python Library

Plexe Platform

​Datasets

​Pandas DataFrames

​DatasetGenerator

​Schemas

​Dictionary-Based Schemas

​Pydantic Models

​Schema Inference

​How Plexe Uses Schemas and Datasets Internally

​Best Practices

Datasets

Pandas DataFrames

DatasetGenerator

Schemas

Dictionary-Based Schemas

Pydantic Models

Schema Inference

How Plexe Uses Schemas and Datasets Internally

Best Practices