skippa package

Subpackages

skippa.transformers package

Submodules

skippa.app module

skippa.pipeline module

Defining a Skippa pipeline

>>> import pandas as pd
>>> from skippa import Skippa, columns
>>> from sklearn.linear_model import LogisticRegression

>>> X = pd.DataFrame({
>>>     'q': [2, 3, 4],
>>>     'x': ['a', 'b', 'c'],
>>>     'y': [1, 16, 1000],
>>>     'z': [0.4, None, 8.7]
>>> })
>>> y = np.array([0, 0, 1])

>>> pipe = (
>>>     Skippa()
>>>         .impute(columns(dtype_include='number'), strategy='median')
>>>         .scale(columns(dtype_include='number'), type='standard')
>>>         .onehot(columns(['x']))
>>>         .select(columns(['y', 'z']) + columns(pattern='x_*'))
>>>         .model(LogisticRegression())
>>> )

>>> pipe.fit(X=X, y=y)
>>> predictions = pipe.predict_proba(X)

class skippa.pipeline.Skippa(**kwargs)[source]

Bases: object

Skippa pipeline class

A Skippa pipeline can be extended by piping transformation commands. Only a number of implemented transformations is supported. Although these transformations use existing scikit-learn transformations, each one reqwuires a specific wrapper that implements the pandas dataframe support

append(pipe)[source]

Just an alias for adding

Return type: Skippa

apply(*args, **kwargs)[source]

Apply a function to the dataframe.

This is a wrapper around pandas’ .apply method and uses the same syntax.

Parameters

*args – first arg should be the funciton to apply
**kwargs – e.g. axis to apply function on

Returns

just return itself again (so we can use piping)

Return type

Skippa

as_type(*args, **kwargs)[source]

Alias for .cast

Return type: Skippa

assign(**kwargs)[source]

Create new columns based on data in existing columns

This is a wrapper around pandas’ .assign method and uses the same syntax.

Parameters: **kwargs – keyword args denoting new_column=assignment_function pairs
Returns: just return itself again (so we can use piping)
Return type: Skippa

astype(*args, **kwargs)[source]

Alias for .cast

Return type: Skippa

build(**kwargs)[source]

Build into a scikit-learn Pipeline

Returns: An sklearn Pipeline that supports .fit, .transform
Return type: Pipeline

cast(cols, dtype)[source]

Cast column to another data type.

Parameters

cols (ColumnSelector) – [description]
**kwargs – arguments for the actual transformer

Returns

just return itself again (so we can use piping)

Return type

Skippa

concat(pipe)[source]

Concatenate output of this pipeline to another.

Where adding/appending extends the pipeline, concat keeps parallel pipelines and concatenates their outcomes.

Parameters: pipe (Skippa) – [description]
Returns: [description]
Return type: Skippa

encode_date(cols, **kwargs)[source]

A date cannot be used unless you encode it into features.

This encoder creates new features out of the year, month, day etc.

Parameters

cols ([type]) – [description]
**kwargs – optional keywords like <datepart>=True/False, indicating whether to use dt.<datepart> as a new feature

Returns

[description]

Return type

Skippa

fillna(cols, value)[source]

Alias/shortcut for impute with constant value (after pandas’ .fillna).

This implementation doesn’t use pandas.DataFrame.fillna(), but sklearn’s SimpleImputer

Parameters: cols (ColumnSelector) – _description_
Returns: just return itself again (so we can use piping)
Return type: Skippa

impute(cols, **kwargs)[source]

Skippa wrapper around sklearn’s SimpleImputer

Parameters: cols (ColumnSelector) – [description]
Returns: just return itself again (so we can use piping)
Return type: Skippa

label_encode(cols, **kwargs)[source]

Wrapper around sklearn’s LabelEncoder

Parameters

cols (ColumnSelector) – columns specification
**kwargs – optional kwargs for LabelEncoder

Returns

just return itself again (so we can use piping)

Return type

Skippa

static load(path)[source]

Load a previously saved skippa

N.B. dill is used for (de)serialization, because joblib/pickle doesn’t support things like lambda functions.

Parameters: path (PathLike) – pathamae, either string or pathlib.Path
Returns: an sklearn Pipeline
Return type: Pipeline

static load_pipeline(path)[source]

Load a previously saved pipeline

N.B. dill is used for (de)serialization, because joblib/pickle doesn’t support things like lambda functions.

Parameters: path (PathLike) – pathname, either string or pathlib.Path
Returns: an extended sklearn Pipeline
Return type: SkippaPipeline

model(model)[source]

Add a model estimator.

A model estimator is always the last step in the pipeline! Therefore this doesn’t return the Skippa object (self) but calls the .build method to return the pipeline.

Parameters: model (BaseEstimator) – An sklearn estimator
Returns: a built pipeline
Return type: SkippaPipeline

onehot(cols, **kwargs)[source]

Skippa wrapper around sklearn’s OneHotEncoder

Parameters

cols (ColumnSelector) – columns specification
**kwargs – optional kwargs for OneHotEncoder (although ‘sparse’ will always be set to False)

Returns

just return itself again (so we can use piping)

Return type

Skippa

ordinal_encode(cols, **kwargs)[source]

Wrapper around sklearn’s OrdinalEncoder

Parameters

cols (ColumnSelector) – columns specification
**kwargs – optional kwargs for OrdinalEncoder

Returns

just return itself again (so we can use piping)

Return type

Skippa

pca(cols, **kwargs)[source]

Wrapper around sklearn.decomposition.PCA

Parameters

cols (ColumnSelector) – columns expression
kwargs – any kwargs to be used by PCA’s __init__

Returns

just return itself again (so we can use piping)

Return type

Skippa

rename(*args, **kwargs)[source]

Rename certain columns.

Two ways to use this: - a dict which defines a mapping {existing_col: new_col} - a column selector and a renaming function (e.g. [‘a’, ‘b’, ‘c’], lambda c: f’new_{c}’) It adds an XRenamer step, which wraps around pandas.rename

Returns: just return itself again (so we can use piping)
Return type: Skippa

save(file_path)[source]

Save to disk using dill

Return type: None

scale(cols, type='standard', **kwargs)[source]

Skippa wrapper around sklearn’s StandardScaler / MinMaxScaler

Parameters

cols (ColumnSelector) – [description]
type (str, optional) – One of [‘standard’, ‘minmax’]. Defaults to ‘standard’.

Raises

ValueError – if an unknown/unsupported scaler type is passed

Returns

just return itself again (so we can use piping)

Return type

Skippa

select(cols)[source]

Apply a column selection

Parameters: cols (ColumnSelector) – [description]
Returns: just return itself again (so we can use piping)
Return type: Skippa

class skippa.pipeline.SkippaPipeline(steps, *, memory=None, verbose=False)[source]

Bases: Pipeline

Extension of sklearn’s Pipeline object.

While the Skippa class is for creating pipelines, it is not a pipeline itself. Only after adding a model estimator step, or by calling .build explicitly, is a SkippaPipeline created. This is basically an sklearn Pipeline with some added methods.

create_gradio_app(**kwargs)[source]

Create a Gradio app for model inspection.

Parameters: **kwargs – kwargs received by Gradio’s Interface() initialisation
Returns: Gradio Interface object -> call .launch to start the app
Return type: gr.Interface

fit(X, y=None, **kwargs)[source]

Inspect input data before fitting the pipeline.

Return type: SkippaPipeline

get_data_profile()[source]

The DataProfile is used in the Gradio app.

The profile contains information on column names, their dtypes and value ranges.

Raises: NotFittedError – If pipeline has not been fitted there is no data profile yet.
Returns: Simple object containing necessary info
Return type: DataProfile

get_model()[source]

Get the model estimator part of the pipeline.

So that you can access info like coefficients e.d.

Returns: fitted model
Return type: BaseEstimator

get_pipeline_params(params)[source]

Translate model param grid to Pipeline param grid.

For GridSearch over a Pipeline, you need to sdupply a param grid in the form of { <stepname>__<paramname>: values } Since it’s non-trivial to find the name of the model/estimator step in the Pipeline, this auto detects it and return a new param grid in the right format.

Parameters: params (Dict) – param grid with parameter names containing only the model parameter
Returns: param grid with parameter names relating to both the pipeline step and the model parameter
Return type: Dict

save(file_path)[source]

Return type: None

steps: List[Any]

test(X, up_to_step=-1)[source]

Test what happens to data in a pipeline.

This allows you to execute the pipeline up & until the last step before modeling (or any other step) and get the resulting data.

Parameters

X (_type_) – _description_
up_to_step (int, optional) – _description_. Defaults to -1.

Returns

_description_

Return type

pd.DataFrame

skippa.profile module

DataProfile is used for storing and retrieving metadata of data that is used in the pipeline. Typically the DataProfile is created during fitting of a pipeline. The profile is used by the Gradio app that can be created.

class skippa.profile.DataProfile(df, y=None)[source]

Bases: object

MAX_NUM_DISTINCT_VALUES = 100000

is_classification()[source]

Return type: bool

is_regression()[source]

Return type: bool

skippa.utils module

skippa.utils.get_dummy_data(nrows=100, nfloat=4, nint=2, nchar=3, ndate=1, missing=True, binary_y=True)[source]

Create dummy data.

Parameters

nrows (int, optional) – Number of total rows. Defaults to 100.
nfloat (int, optional) – Number of float columns. Defaults to 4.
nint (int, optional) – Number of integer columns. Defaults to 2.
nchar (int, optional) – Number of character/categorical columns. Defaults to 3.
ndate (int, optional) – Number of date columns. Defaults to 1.
binary_y (bool, optional) – If True, returns 0’s & 1’s for y, otherwise float values between 0 & 100

Returns

A pandas DataFrame for features and a numpy array for labels

Return type

Tuple[pd.DataFrame, np.ndarray]

Module contents

Top-level package for skippa.

The pipeline module defines the main Skippa methods The transformers subpackage contains various transformers used in the pipeline.