Distributed Training with Ray
Leverage Ray for distributed and parallel model building with Plexe.
Plexe can utilize Ray to distribute parts of the model building process, potentially speeding up execution, especially when exploring multiple solution candidates (max_iterations
> 1) or when individual training runs are computationally intensive.
Prerequisites
- Install Ray: If you haven’t already, install Ray. You might need specific dependencies depending on your environment (e.g., for cluster support).
- Install Plexe with Ray support: Ensure you have the necessary Plexe dependencies.
- (Optional) Ray Cluster: For true distribution, you need a running Ray cluster. You can start one locally or connect to a remote cluster. See the Ray documentation for setup instructions. If no cluster is running, Ray will typically utilize multiple cores on your local machine.
Enabling Distributed Execution
To enable distributed execution with Ray, simply set the distributed=True
flag when initializing the plexe.Model
.
How it Works
When distributed=True
is set:
- Executor Selection: Plexe checks if Ray is available and initialized. If yes, it uses the
RayExecutor
for running the generated training code; otherwise, it falls back to the defaultProcessExecutor
. - Task Distribution: The
RayExecutor
submits the training code execution as a remote task (@ray.remote
) to the Ray cluster (or local Ray instance). - Parallelism: If
max_iterations
inmodel.build()
is greater than 1, and the Ray cluster has sufficient resources (CPUs/GPUs), Ray can potentially execute multiple training iterations in parallel, significantly speeding up the exploration of different model candidates. - Results: Results (metrics, artifacts, logs, exceptions) are collected from the Ray tasks and integrated back into the Plexe model building process.
The degree of speedup depends on the number of iterations, the computational cost of each training run, and the resources available in your Ray cluster (or on your local machine). Tasks involving LLM calls for planning or code generation are typically not distributed via Ray in the current implementation.
Configuration (Advanced)
While distributed=True
is the primary switch, advanced Ray configuration (like cluster address, resource limits) can be managed through Ray’s standard initialization methods (ray.init(...)
) or configuration files before initializing the plexe.Model
. Plexe itself reads some Ray configurations from its internal config (plexe.config.config.ray
) which might be relevant in specific deployment scenarios, but direct ray.init()
is the most common way to configure the connection.