ray.data.preprocessors.CustomKBinsDiscretizer#
- class ray.data.preprocessors.CustomKBinsDiscretizer(columns: List[str], bins: Iterable[float] | pandas.IntervalIndex | Dict[str, Iterable[float] | pandas.IntervalIndex], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None)[source]#
Bases:
_AbstractKBinsDiscretizerBin values into discrete intervals using custom bin edges.
Columns must contain numerical values.
Examples
Use
CustomKBinsDiscretizerto bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CustomKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=[0, 1, 4, 10, 25] ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0 2 1 1 3 2 1 3 3 2 3 4 2 3 5 1 3
You can also specify different bin edges per column.
>>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins={"value_1": [0, 1, 4], "value_2": [0, 18, 35, 70]}, ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0.0 0 1 1.0 0 2 1.0 0 3 NaN 0 4 NaN 1 5 1.0 1
- Parameters:
columns – The columns to discretize.
bins – Defines custom bin edges. Can be an iterable of numbers, a
pd.IntervalIndex, or a dict mapping columns to either of them. Note thatpd.IntervalIndexfor bins must be non-overlapping.right – Indicates whether bins include the rightmost edge.
include_lowest – Indicates whether the first interval should be left-inclusive.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueErroror drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtypeobjects ornp.integertypes. If you don’t include a column indtypesor specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype, the outputted column will be apd.CategoricalDtypewith the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)to preserve information about bin order.
See also
UniformKBinsDiscretizerIf you want to bin data into uniform width bins.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize().Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.