Dataset API#

Dataset#

class ray.data.Dataset(plan: ExecutionPlan, logical_plan: LogicalPlan)[source]#

A Dataset is a distributed data collection for data loading and processing.

Datasets are distributed pipelines that produce ObjectRef[Block] outputs, where each block holds data in Arrow format, representing a shard of the overall data collection. The block also determines the unit of parallelism. For more details, see Ray Data Internals.

Datasets can be created in multiple ways: from synthetic data via range_*() APIs, from existing memory data via from_*() APIs (this creates a subclass of Dataset called MaterializedDataset), or from external storage systems such as local disk, S3, HDFS etc. via the read_*() APIs. The (potentially processed) Dataset can be saved back to external storage systems via the write_*() APIs.

Examples

import ray
# Create dataset from synthetic data.
ds = ray.data.range(1000)
# Create dataset from in-memory data.
ds = ray.data.from_items(
    [{"col1": i, "col2": i * 2} for i in range(1000)]
)
# Create dataset from external storage system.
ds = ray.data.read_parquet("s3://bucket/path")
# Save dataset back to external storage system.
ds.write_csv("s3://bucket/output")

Dataset has two kinds of operations: transformation, which takes in Dataset and outputs a new Dataset (e.g. map_batches()); and consumption, which produces values (not a data stream) as output (e.g. iter_batches()).

Dataset transformations are lazy, with execution of the transformations being triggered by downstream consumption.

Dataset supports parallel processing at scale: transformations such as map_batches(), aggregations such as min()/max()/mean(), grouping via groupby(), shuffling operations such as sort(), random_shuffle(), and repartition().

Examples

>>> import ray
>>> ds = ray.data.range(1000)
>>> # Transform batches (Dict[str, np.ndarray]) with map_batches().
>>> ds.map_batches(lambda batch: {"id": batch["id"] * 2})  
MapBatches(<lambda>)
+- Dataset(num_rows=1000, schema={id: int64})
>>> # Compute the maximum.
>>> ds.max("id")
999
>>> # Shuffle this dataset randomly.
>>> ds.random_shuffle()  
RandomShuffle
+- Dataset(num_rows=1000, schema={id: int64})
>>> # Sort it back in order.
>>> ds.sort("id")  
Sort
+- Dataset(num_rows=1000, schema={id: int64})

Both unexecuted and materialized Datasets can be passed between Ray tasks and actors without incurring a copy. Dataset supports conversion to/from several more featureful dataframe libraries (e.g., Spark, Dask, Modin, MARS), and are also compatible with distributed TensorFlow / PyTorch.

Basic Transformations#

`Dataset.add_column`	Add the given column to the dataset.
`Dataset.drop_columns`	Drop one or more columns from the dataset.
`Dataset.filter`	Filter out rows that don't satisfy the given predicate.
`Dataset.flat_map`	Apply the given function to each row and then flatten results.
`Dataset.limit`	Truncate the dataset to the first `limit` rows.
`Dataset.map`	Apply the given function to each row of this dataset.
`Dataset.map_batches`	Apply the given function to batches of data.
`Dataset.random_sample`	Returns a new `Dataset` containing a random fraction of the rows.
`Dataset.rename_columns`	Rename columns in the dataset.
`Dataset.select_columns`	Select one or more columns from the dataset.

Consuming Data#

`Dataset.iter_batches`	Return an iterable over batches of data.
`Dataset.iter_rows`	Return an iterable over the rows in this dataset.
`Dataset.iter_torch_batches`	Return an iterable over batches of data represented as Torch tensors.
`Dataset.iterator`	Return a `DataIterator` over this dataset.
`Dataset.show`	Print up to the given number of rows from the `Dataset`.
`Dataset.take`	Return up to `limit` rows from the `Dataset`.
`Dataset.take_all`	Return all of the rows in this `Dataset`.
`Dataset.take_batch`	Return up to `batch_size` rows from the `Dataset` in a batch.

Execution#

Dataset.materialize

Execute and materialize this dataset into object store memory.

Grouped and Global aggregations#

`Dataset.aggregate`	Aggregate values using one or more functions.
`Dataset.groupby`	Group rows of a `Dataset` according to a column.
`Dataset.max`	Return the maximum of one or more columns.
`Dataset.mean`	Compute the mean of one or more columns.
`Dataset.min`	Return the minimum of one or more columns.
`Dataset.std`	Compute the standard deviation of one or more columns.
`Dataset.sum`	Compute the sum of one or more columns.
`Dataset.unique`	List the unique elements in a given column.

I/O and Conversion#

`Dataset.to_dask`	Convert this `Dataset` into a Dask DataFrame.
`Dataset.to_mars`	Convert this `Dataset` into a Mars DataFrame.
`Dataset.to_modin`	Convert this `Dataset` into a Modin DataFrame.
`Dataset.to_pandas`	Convert this `Dataset` to a single pandas DataFrame.
`Dataset.to_spark`	Convert this `Dataset` into a Spark DataFrame.
`Dataset.to_tf`	Return a TensorFlow Dataset over this `Dataset`.
`Dataset.write_csv`	Writes the `Dataset` to CSV files.
`Dataset.write_images`	Writes the `Dataset` to images.
`Dataset.write_json`	Writes the `Dataset` to JSON and JSONL files.
`Dataset.write_mongo`	Writes the `Dataset` to a MongoDB database.
`Dataset.write_numpy`	Writes a column of the `Dataset` to .npy files.
`Dataset.write_parquet`	Writes the `Dataset` to parquet files under the provided `path`.
`Dataset.write_tfrecords`	Write the `Dataset` to TFRecord files.
`Dataset.write_webdataset`	Writes the dataset to WebDataset files.

Inspecting Metadata#

`Dataset.columns`	Returns the columns of this Dataset.
`Dataset.count`	Count the number of rows in the dataset.
`Dataset.input_files`	Return the list of input files for the dataset.
`Dataset.num_blocks`	Return the number of blocks of this `Dataset`.
`Dataset.schema`	Return the schema of the dataset.
`Dataset.size_bytes`	Return the in-memory size of the dataset.
`Dataset.stats`	Returns a string containing execution timing information.

Sorting, Shuffling and Repartitioning#

`Dataset.random_shuffle`	Randomly shuffle the rows of this `Dataset`.
`Dataset.randomize_block_order`	Randomly shuffle the blocks of this `Dataset`.
`Dataset.repartition`	Repartition the `Dataset` into exactly this number of blocks.
`Dataset.sort`	Sort the dataset by the specified key column or key function.

Splitting and Merging datasets#

`Dataset.split`	Materialize and split the dataset into `n` disjoint pieces.
`Dataset.split_at_indices`	Materialize and split the dataset at the given indices (like `np.split`).
`Dataset.split_proportionately`	Materialize and split the dataset using proportions.
`Dataset.streaming_split`	Returns `n` `DataIterators` that can be used to read disjoint subsets of the dataset in parallel.
`Dataset.train_test_split`	Materialize and split the dataset into train and test subsets.
`Dataset.union`	Concatenate `Datasets` across rows.
`Dataset.zip`	Zip the columns of this dataset with the columns of another.

Schema#

class ray.data.Schema(base_schema: pyarrow.lib.Schema | PandasBlockSchema, *, data_context: DataContext | None = None)[source]#

Dataset schema.

base_schema#: The underlying Arrow or Pandas schema.

PublicAPI (beta): This API is in beta and may change before becoming stable.

Developer API#

`Dataset.to_pandas_refs`	Converts this `Dataset` into a distributed set of Pandas dataframes.
`Dataset.to_numpy_refs`	Converts this `Dataset` into a distributed set of NumPy ndarrays or dictionary of NumPy ndarrays.
`Dataset.to_arrow_refs`	Convert this `Dataset` into a distributed set of PyArrow tables.
`Dataset.iter_internal_ref_bundles`	Get an iterator over `RefBundles` belonging to this Dataset.
`block.Block`	alias of `pyarrow.Table` \| `pandas.DataFrame`
`block.BlockExecStats`	Execution stats for this block.
`block.BlockMetadata`	Metadata about the block.
`block.BlockAccessor`	Provides accessor methods for a specific block.

Deprecated API#

`Dataset.iter_tf_batches`	Return an iterable over batches of data represented as TensorFlow tensors.
`Dataset.to_torch`	Return a Torch IterableDataset over this `Dataset`.