ray.data.Dataset.select_columns#

Select one or more columns from the dataset.

Specified columns must be in the dataset schema.

Tip

If you’re reading parquet files with ray.data.read_parquet(), you might be able to speed it up by using projection pushdown; see Parquet column pruning for details.

Examples

>>> import ray
>>> ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet")
>>> ds.schema()
Column        Type
------        ----
sepal.length  double
sepal.width   double
petal.length  double
petal.width   double
variety       string
>>> ds.select_columns(["sepal.length", "sepal.width"]).schema()
Column        Type
------        ----
sepal.length  double
sepal.width   double

Time complexity: O(dataset size / parallelism)

Parameters:

cols – Names of the columns to select. If a name isn’t in the dataset schema, an exception is raised. Columns also should be unique.
compute – This argument is deprecated. Use concurrency argument.
concurrency – The number of Ray workers to use concurrently. For a fixed-sized worker pool of size n, specify concurrency=n. For an autoscaling worker pool from m to n workers, specify concurrency=(m, n).
ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See ray.remote() for details.