ray.data.preprocessors.OrdinalEncoder#
- class ray.data.preprocessors.OrdinalEncoder(columns: List[str], *, encode_lists: bool = True, output_columns: List[str] | None = None)[source]#
 Bases:
PreprocessorEncode values within columns as ordered integer values.
OrdinalEncoderencodes categorical features as integers that range from \(0\) to \(n - 1\), where \(n\) is the number of categories.If you transform a value that isn’t in the fitted datset, then the value is encoded as
float("nan").Columns must contain either hashable values or lists of hashable values. Also, you can’t have both scalars and lists in the same column.
Examples
Use
OrdinalEncoderto encode categorical features as integers.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import OrdinalEncoder >>> df = pd.DataFrame({ ... "sex": ["male", "female", "male", "female"], ... "level": ["L4", "L5", "L3", "L4"], ... }) >>> ds = ray.data.from_pandas(df) >>> encoder = OrdinalEncoder(columns=["sex", "level"]) >>> encoder.fit_transform(ds).to_pandas() sex level 0 1 1 1 0 2 2 1 0 3 0 1
OrdinalEncodercan also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> encoder = OrdinalEncoder(columns=["sex", "level"], output_columns=["sex_encoded", "level_encoded"]) >>> encoder.fit_transform(ds).to_pandas() sex level sex_encoded level_encoded 0 male L4 1 1 1 female L5 0 2 2 male L3 1 0 3 female L4 0 1
If you transform a value not present in the original dataset, then the value is encoded as
float("nan").>>> df = pd.DataFrame({"sex": ["female"], "level": ["L6"]}) >>> ds = ray.data.from_pandas(df) >>> encoder.transform(ds).to_pandas() sex level 0 0 NaN
OrdinalEncodercan also encode categories in a list.>>> df = pd.DataFrame({ ... "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"], ... "genre": [ ... ["comedy", "action", "sports"], ... ["animation", "comedy", "action"], ... ["documentary"], ... ], ... }) >>> ds = ray.data.from_pandas(df) >>> encoder = OrdinalEncoder(columns=["genre"]) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [2, 0, 4] 1 Moana [1, 2, 0] 2 The Smartest Guys in the Room [3]
- Parameters:
 columns – The columns to separately encode.
encode_lists – If
True, encode list elements. IfFalse, encode whole lists (i.e., replace each list with an integer).Trueby default.output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columnsmust match the length ofcolumns, othwerwise an error will be raised.
See also
OneHotEncoderAnother preprocessor that encodes categorical data.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize().Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.