ray.data.preprocessors.Concatenator#
- class ray.data.preprocessors.Concatenator(columns: List[str], output_column_name: str = 'concat_out', dtype: numpy.dtype | None = None, raise_if_missing: bool = False)[source]#
Bases:
PreprocessorCombine numeric columns into a column of type
TensorDtype. Only columns specified incolumnswill be concatenated.This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains
TensorArrayElementobjects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation. The preprocessor preserves the order of the columns provided in thecolummnsargument and will use that order when callingtransform()andtransform_batch().Examples
>>> import numpy as np >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Concatenator
Concatenatorcombines numeric columns into a column ofTensorDtype.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator(columns=["X0", "X1"]) >>> concatenator.transform(ds).to_pandas() concat_out 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
By default, the created column is called
"concat_out", but you can specify a different name.>>> concatenator = Concatenator(columns=["X0", "X1"], output_column_name="tensor") >>> concatenator.transform(ds).to_pandas() tensor 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
>>> concatenator = Concatenator(columns=["X0", "X1"], dtype=np.float32) >>> concatenator.transform(ds) Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
- Parameters:
output_column_name – The desired name for the new column. Defaults to
"concat_out".columns – A list of columns to concatenate. The provided order of the columns will be retained during concatenation.
dtype – The
dtypeto convert the output tensors to. If unspecified, thedtypeis determined by standard coercion rules.raise_if_missing – If
True, an error is raised if any of the columns incolumnsdon’t exist. Defaults toFalse.
- Raises:
ValueError – if
raise_if_missingisTrueand a column incolumnsor doesn’t exist in the dataset.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize().Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.