How Plexe handles data inputs and defines the structure of ML models.
In Plexe, datasets and schemas work together to define what data your model consumes and what outputs it produces. Understanding how these components interact is key to creating effective models.
The most common way to provide data to Plexe is through Pandas DataFrames, which offer a tabular structure that’s ideal for machine learning:
Copy
Ask AI
import pandas as pdimport plexe# Load data into a DataFramedf = pd.read_csv("customer_data.csv")# Create a model and build it with the DataFramemodel = plexe.Model(intent="Predict customer churn")model.build(datasets=[df])
When using DataFrames:
Plexe works with one or more DataFrames (pass them as a list)
It automatically analyzes columns to understand data types and identify potential targets
For multiple DataFrames, Plexe can detect relationships between them based on column names and data
For cases where you have limited or no existing data, you can use the DatasetGenerator class to create synthetic training data:
Copy
Ask AI
from plexe import DatasetGeneratorfrom pydantic import BaseModel# Define schema for synthetic dataclass CustomerData(BaseModel): age: int subscription_months: int monthly_spend: float churn: bool# Create a generator for synthetic datagenerator = DatasetGenerator( description="Generate customer data with churn information", provider="openai/gpt-4o-mini", # Specify LLM provider for generation schema=CustomerData)# Generate data (specify number of samples)generator.generate(num_samples=100)# Use the generator when building the modelmodel = plexe.Model(intent="Predict customer churn")model.build(datasets=[generator])
The DatasetGenerator:
Creates realistic synthetic data based on the schema and description
Requires a provider specification for the LLM that will generate the data
Can augment existing data by passing an existing DataFrame in the constructor
Explicitly generates samples with the generate() method before model building
If you don’t provide explicit schemas but do provide datasets, Plexe will attempt to infer the schemas:
Copy
Ask AI
import pandas as pdimport plexe# Load datadf = pd.read_csv("housing_data.csv")# Create model without schemasmodel = plexe.Model(intent="Predict house prices based on property features")# Build with data - schemas will be inferredmodel.build(datasets=[df])# Access the inferred schemasprint("Inferred input schema:", model.input_schema)print("Inferred output schema:", model.output_schema)
During schema inference:
Plexe analyzes the DataFrame columns to determine types
It uses the model intent and column names to identify likely target variables
Input features and output targets are separated based on this analysis