Ray Train API#
PyTorch Ecosystem#
| A Trainer for data parallel PyTorch training. | |
| Configuration for torch process group setup. | |
| Configuration for torch XLA setup. | 
PyTorch#
| Gets the correct torch device configured for this process. | |
| Gets the correct torch device list configured for this process. | |
| Prepares the model for distributed execution. | |
| Prepares  | |
| Limits sources of nondeterministic behavior. | 
PyTorch Lightning#
| Prepare the PyTorch Lightning Trainer for distributed execution. | |
| Setup Lightning DDP training environment for Ray cluster. | |
| Subclass of DDPStrategy to ensure compatibility with Ray orchestration. | |
| Subclass of FSDPStrategy to ensure compatibility with Ray orchestration. | |
| Subclass of DeepSpeedStrategy to ensure compatibility with Ray orchestration. | |
| A simple callback that reports checkpoints to Ray on train epoch end. | 
Hugging Face Transformers#
| Prepare your HuggingFace Transformer Trainer for Ray Train. | |
| A simple callback to report checkpoints and metrics to Ray Train. | 
More Frameworks#
Tensorflow/Keras#
| A Trainer for data parallel Tensorflow training. | |
| PublicAPI (beta): This API is in beta and may change before becoming stable. | |
| A utility function that overrides default config for Tensorflow Dataset. | |
| Keras callback for Ray Train reporting and checkpointing. | 
Horovod#
| A Trainer for data parallel Horovod training. | |
| Configurations for Horovod setup. | 
XGBoost#
| A Trainer for data parallel XGBoost training. | |
| XGBoost callback to save checkpoints and report metrics. | 
LightGBM#
| A Trainer for data parallel LightGBM training. | |
| Creates a callback that reports metrics and checkpoints model. | 
Ray Train Configuration#
| Configurable parameters for defining the checkpointing strategy. | |
| Class responsible for configuring Train dataset preprocessing. | |
| Configuration related to failure handling of each training/tuning run. | |
| Runtime configuration for training and tuning runs. | |
| Configuration for scaling training. | |
| Configuration object for Train/Tune file syncing to  | 
Ray Train Utilities#
Classes
| A reference to data persisted as a directory in local or remote storage. | |
| Context containing metadata that can be accessed within Ray Train workers. | 
Functions
| Access the latest reported checkpoint to resume from if one exists. | |
| Get or create a singleton training context. | |
| Returns the  | |
| Report metrics and optionally save a checkpoint. | 
Ray Train Output#
| The final result of a ML training run or a Tune trial. | 
Ray Train Errors#
| Indicates a method or function was used outside of a session. | |
| An error indicating that training has failed. | 
Ray Train Developer APIs#
Trainer Base Classes#
| Defines interface for distributed training on Ray. | |
| A Trainer for data parallel training. | 
Train Backend Base Classes#
| Singleton for distributed communication backend. | |
| Parent class for configurations of training backend. |