ray.data.Dataset.select_columns#
- Dataset.select_columns(cols: str | List[str], *, compute: str | ComputeStrategy | None = None, concurrency: int | Tuple[int, int] | None = None, **ray_remote_args) Dataset[source]#
Select one or more columns from the dataset.
Specified columns must be in the dataset schema.
Tip
If you’re reading parquet files with
ray.data.read_parquet(), you might be able to speed it up by using projection pushdown; see Parquet column pruning for details.Examples
>>> import ray >>> ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") >>> ds.schema() Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string >>> ds.select_columns(["sepal.length", "sepal.width"]).schema() Column Type ------ ---- sepal.length double sepal.width double
Time complexity: O(dataset size / parallelism)
- Parameters:
cols – Names of the columns to select. If a name isn’t in the dataset schema, an exception is raised. Columns also should be unique.
compute – This argument is deprecated. Use
concurrencyargument.concurrency – The number of Ray workers to use concurrently. For a fixed-sized worker pool of size
n, specifyconcurrency=n. For an autoscaling worker pool frommtonworkers, specifyconcurrency=(m, n).ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See
ray.remote()for details.