Ray Train API#
PyTorch Ecosystem#
A Trainer for data parallel PyTorch training.  | 
|
Configuration for torch process group setup.  | 
|
Configuration for torch XLA setup.  | 
PyTorch#
Gets the correct torch device configured for this process.  | 
|
Gets the correct torch device list configured for this process.  | 
|
Prepares the model for distributed execution.  | 
|
Prepares   | 
|
Limits sources of nondeterministic behavior.  | 
PyTorch Lightning#
Prepare the PyTorch Lightning Trainer for distributed execution.  | 
|
Setup Lightning DDP training environment for Ray cluster.  | 
|
Subclass of DDPStrategy to ensure compatibility with Ray orchestration.  | 
|
Subclass of FSDPStrategy to ensure compatibility with Ray orchestration.  | 
|
Subclass of DeepSpeedStrategy to ensure compatibility with Ray orchestration.  | 
|
A simple callback that reports checkpoints to Ray on train epoch end.  | 
Hugging Face Transformers#
Prepare your HuggingFace Transformer Trainer for Ray Train.  | 
|
A simple callback to report checkpoints and metrics to Ray Train.  | 
More Frameworks#
Tensorflow/Keras#
A Trainer for data parallel Tensorflow training.  | 
|
PublicAPI (beta): This API is in beta and may change before becoming stable.  | 
|
A utility function that overrides default config for Tensorflow Dataset.  | 
|
Keras callback for Ray Train reporting and checkpointing.  | 
Horovod#
A Trainer for data parallel Horovod training.  | 
|
Configurations for Horovod setup.  | 
XGBoost#
A Trainer for data parallel XGBoost training.  | 
|
XGBoost callback to save checkpoints and report metrics.  | 
LightGBM#
A Trainer for data parallel LightGBM training.  | 
|
Creates a callback that reports metrics and checkpoints model.  | 
Ray Train Configuration#
Configurable parameters for defining the checkpointing strategy.  | 
|
Class responsible for configuring Train dataset preprocessing.  | 
|
Configuration related to failure handling of each training/tuning run.  | 
|
Runtime configuration for training and tuning runs.  | 
|
Configuration for scaling training.  | 
|
Configuration object for Train/Tune file syncing to   | 
Ray Train Utilities#
Classes
A reference to data persisted as a directory in local or remote storage.  | 
|
Context containing metadata that can be accessed within Ray Train workers.  | 
Functions
Access the latest reported checkpoint to resume from if one exists.  | 
|
Get or create a singleton training context.  | 
|
Returns the   | 
|
Report metrics and optionally save a checkpoint.  | 
Ray Train Output#
The final result of a ML training run or a Tune trial.  | 
Ray Train Errors#
Indicates a method or function was used outside of a session.  | 
|
An error indicating that training has failed.  | 
Ray Train Developer APIs#
Trainer Base Classes#
Defines interface for distributed training on Ray.  | 
|
A Trainer for data parallel training.  | 
Train Backend Base Classes#
Singleton for distributed communication backend.  | 
|
Parent class for configurations of training backend.  |