skippa package

Subpackages

Submodules

skippa.app module

skippa.pipeline module

Defining a Skippa pipeline

>>> import pandas as pd
>>> from skippa import Skippa, columns
>>> from sklearn.linear_model import LogisticRegression
>>> X = pd.DataFrame({
>>>     'q': [2, 3, 4],
>>>     'x': ['a', 'b', 'c'],
>>>     'y': [1, 16, 1000],
>>>     'z': [0.4, None, 8.7]
>>> })
>>> y = np.array([0, 0, 1])
>>> pipe = (
>>>     Skippa()
>>>         .impute(columns(dtype_include='number'), strategy='median')
>>>         .scale(columns(dtype_include='number'), type='standard')
>>>         .onehot(columns(['x']))
>>>         .select(columns(['y', 'z']) + columns(pattern='x_*'))
>>>         .model(LogisticRegression())
>>> )
>>> pipe.fit(X=X, y=y)
>>> predictions = pipe.predict_proba(X)
class skippa.pipeline.Skippa(**kwargs)[source]

Bases: object

Skippa pipeline class

A Skippa pipeline can be extended by piping transformation commands. Only a number of implemented transformations is supported. Although these transformations use existing scikit-learn transformations, each one reqwuires a specific wrapper that implements the pandas dataframe support

append(pipe)[source]

Just an alias for adding

Return type

Skippa

apply(*args, **kwargs)[source]

Apply a function to the dataframe.

This is a wrapper around pandas’ .apply method and uses the same syntax.

Parameters
  • *args – first arg should be the funciton to apply

  • **kwargs – e.g. axis to apply function on

Returns

just return itself again (so we can use piping)

Return type

Skippa

as_type(*args, **kwargs)[source]

Alias for .cast

Return type

Skippa

assign(**kwargs)[source]

Create new columns based on data in existing columns

This is a wrapper around pandas’ .assign method and uses the same syntax.

Parameters

**kwargs – keyword args denoting new_column=assignment_function pairs

Returns

just return itself again (so we can use piping)

Return type

Skippa

astype(*args, **kwargs)[source]

Alias for .cast

Return type

Skippa

build(**kwargs)[source]

Build into a scikit-learn Pipeline

Returns

An sklearn Pipeline that supports .fit, .transform

Return type

Pipeline

cast(cols, dtype)[source]

Cast column to another data type.

Parameters
  • cols (ColumnSelector) – [description]

  • **kwargs – arguments for the actual transformer

Returns

just return itself again (so we can use piping)

Return type

Skippa

concat(pipe)[source]

Concatenate output of this pipeline to another.

Where adding/appending extends the pipeline, concat keeps parallel pipelines and concatenates their outcomes.

Parameters

pipe (Skippa) – [description]

Returns

[description]

Return type

Skippa

encode_date(cols, **kwargs)[source]

A date cannot be used unless you encode it into features.

This encoder creates new features out of the year, month, day etc.

Parameters
  • cols ([type]) – [description]

  • **kwargs – optional keywords like <datepart>=True/False, indicating whether to use dt.<datepart> as a new feature

Returns

[description]

Return type

Skippa

fillna(cols, value)[source]

Alias/shortcut for impute with constant value (after pandas’ .fillna).

This implementation doesn’t use pandas.DataFrame.fillna(), but sklearn’s SimpleImputer

Parameters

cols (ColumnSelector) – _description_

Returns

just return itself again (so we can use piping)

Return type

Skippa

impute(cols, **kwargs)[source]

Skippa wrapper around sklearn’s SimpleImputer

Parameters

cols (ColumnSelector) – [description]

Returns

just return itself again (so we can use piping)

Return type

Skippa

label_encode(cols, **kwargs)[source]

Wrapper around sklearn’s LabelEncoder

Parameters
  • cols (ColumnSelector) – columns specification

  • **kwargs – optional kwargs for LabelEncoder

Returns

just return itself again (so we can use piping)

Return type

Skippa

static load(path)[source]

Load a previously saved skippa

N.B. dill is used for (de)serialization, because joblib/pickle doesn’t support things like lambda functions.

Parameters

path (PathLike) – pathamae, either string or pathlib.Path

Returns

an sklearn Pipeline

Return type

Pipeline

static load_pipeline(path)[source]

Load a previously saved pipeline

N.B. dill is used for (de)serialization, because joblib/pickle doesn’t support things like lambda functions.

Parameters

path (PathLike) – pathname, either string or pathlib.Path

Returns

an extended sklearn Pipeline

Return type

SkippaPipeline

model(model)[source]

Add a model estimator.

A model estimator is always the last step in the pipeline! Therefore this doesn’t return the Skippa object (self) but calls the .build method to return the pipeline.

Parameters

model (BaseEstimator) – An sklearn estimator

Returns

a built pipeline

Return type

SkippaPipeline

onehot(cols, **kwargs)[source]

Skippa wrapper around sklearn’s OneHotEncoder

Parameters
  • cols (ColumnSelector) – columns specification

  • **kwargs – optional kwargs for OneHotEncoder (although ‘sparse’ will always be set to False)

Returns

just return itself again (so we can use piping)

Return type

Skippa

ordinal_encode(cols, **kwargs)[source]

Wrapper around sklearn’s OrdinalEncoder

Parameters
  • cols (ColumnSelector) – columns specification

  • **kwargs – optional kwargs for OrdinalEncoder

Returns

just return itself again (so we can use piping)

Return type

Skippa

pca(cols, **kwargs)[source]

Wrapper around sklearn.decomposition.PCA

Parameters
  • cols (ColumnSelector) – columns expression

  • kwargs – any kwargs to be used by PCA’s __init__

Returns

just return itself again (so we can use piping)

Return type

Skippa

rename(*args, **kwargs)[source]

Rename certain columns.

Two ways to use this: - a dict which defines a mapping {existing_col: new_col} - a column selector and a renaming function (e.g. [‘a’, ‘b’, ‘c’], lambda c: f’new_{c}’) It adds an XRenamer step, which wraps around pandas.rename

Returns

just return itself again (so we can use piping)

Return type

Skippa

save(file_path)[source]

Save to disk using dill

Return type

None

scale(cols, type='standard', **kwargs)[source]

Skippa wrapper around sklearn’s StandardScaler / MinMaxScaler

Parameters
  • cols (ColumnSelector) – [description]

  • type (str, optional) – One of [‘standard’, ‘minmax’]. Defaults to ‘standard’.

Raises

ValueError – if an unknown/unsupported scaler type is passed

Returns

just return itself again (so we can use piping)

Return type

Skippa

select(cols)[source]

Apply a column selection

Parameters

cols (ColumnSelector) – [description]

Returns

just return itself again (so we can use piping)

Return type

Skippa

class skippa.pipeline.SkippaPipeline(steps, *, memory=None, verbose=False)[source]

Bases: Pipeline

Extension of sklearn’s Pipeline object.

While the Skippa class is for creating pipelines, it is not a pipeline itself. Only after adding a model estimator step, or by calling .build explicitly, is a SkippaPipeline created. This is basically an sklearn Pipeline with some added methods.

create_gradio_app(**kwargs)[source]

Create a Gradio app for model inspection.

Parameters

**kwargs – kwargs received by Gradio’s Interface() initialisation

Returns

Gradio Interface object -> call .launch to start the app

Return type

gr.Interface

fit(X, y=None, **kwargs)[source]

Inspect input data before fitting the pipeline.

Return type

SkippaPipeline

get_data_profile()[source]

The DataProfile is used in the Gradio app.

The profile contains information on column names, their dtypes and value ranges.

Raises

NotFittedError – If pipeline has not been fitted there is no data profile yet.

Returns

Simple object containing necessary info

Return type

DataProfile

get_model()[source]

Get the model estimator part of the pipeline.

So that you can access info like coefficients e.d.

Returns

fitted model

Return type

BaseEstimator

get_pipeline_params(params)[source]

Translate model param grid to Pipeline param grid.

For GridSearch over a Pipeline, you need to sdupply a param grid in the form of { <stepname>__<paramname>: values } Since it’s non-trivial to find the name of the model/estimator step in the Pipeline, this auto detects it and return a new param grid in the right format.

Parameters

params (Dict) – param grid with parameter names containing only the model parameter

Returns

param grid with parameter names relating to both the pipeline step and the model parameter

Return type

Dict

save(file_path)[source]
Return type

None

steps: List[Any]
test(X, up_to_step=-1)[source]

Test what happens to data in a pipeline.

This allows you to execute the pipeline up & until the last step before modeling (or any other step) and get the resulting data.

Parameters
  • X (_type_) – _description_

  • up_to_step (int, optional) – _description_. Defaults to -1.

Returns

_description_

Return type

pd.DataFrame

skippa.profile module

DataProfile is used for storing and retrieving metadata of data that is used in the pipeline. Typically the DataProfile is created during fitting of a pipeline. The profile is used by the Gradio app that can be created.

class skippa.profile.DataProfile(df, y=None)[source]

Bases: object

MAX_NUM_DISTINCT_VALUES = 100000
is_classification()[source]
Return type

bool

is_regression()[source]
Return type

bool

skippa.utils module

skippa.utils.get_dummy_data(nrows=100, nfloat=4, nint=2, nchar=3, ndate=1, missing=True, binary_y=True)[source]

Create dummy data.

Parameters
  • nrows (int, optional) – Number of total rows. Defaults to 100.

  • nfloat (int, optional) – Number of float columns. Defaults to 4.

  • nint (int, optional) – Number of integer columns. Defaults to 2.

  • nchar (int, optional) – Number of character/categorical columns. Defaults to 3.

  • ndate (int, optional) – Number of date columns. Defaults to 1.

  • binary_y (bool, optional) – If True, returns 0’s & 1’s for y, otherwise float values between 0 & 100

Returns

A pandas DataFrame for features and a numpy array for labels

Return type

Tuple[pd.DataFrame, np.ndarray]

Module contents

Top-level package for skippa.

The pipeline module defines the main Skippa methods The transformers subpackage contains various transformers used in the pipeline.