skippa package
Subpackages
- skippa.transformers package
- Submodules
- skippa.transformers.base module
- skippa.transformers.custom module
- skippa.transformers.sklearn module
- Module contents
Submodules
skippa.app module
skippa.pipeline module
Defining a Skippa pipeline
>>> import pandas as pd
>>> from skippa import Skippa, columns
>>> from sklearn.linear_model import LogisticRegression
>>> X = pd.DataFrame({
>>> 'q': [2, 3, 4],
>>> 'x': ['a', 'b', 'c'],
>>> 'y': [1, 16, 1000],
>>> 'z': [0.4, None, 8.7]
>>> })
>>> y = np.array([0, 0, 1])
>>> pipe = (
>>> Skippa()
>>> .impute(columns(dtype_include='number'), strategy='median')
>>> .scale(columns(dtype_include='number'), type='standard')
>>> .onehot(columns(['x']))
>>> .select(columns(['y', 'z']) + columns(pattern='x_*'))
>>> .model(LogisticRegression())
>>> )
>>> pipe.fit(X=X, y=y)
>>> predictions = pipe.predict_proba(X)
- class skippa.pipeline.Skippa(**kwargs)[source]
Bases:
object
Skippa pipeline class
A Skippa pipeline can be extended by piping transformation commands. Only a number of implemented transformations is supported. Although these transformations use existing scikit-learn transformations, each one reqwuires a specific wrapper that implements the pandas dataframe support
- apply(*args, **kwargs)[source]
Apply a function to the dataframe.
This is a wrapper around pandas’ .apply method and uses the same syntax.
- Parameters
*args – first arg should be the funciton to apply
**kwargs – e.g. axis to apply function on
- Returns
just return itself again (so we can use piping)
- Return type
- assign(**kwargs)[source]
Create new columns based on data in existing columns
This is a wrapper around pandas’ .assign method and uses the same syntax.
- Parameters
**kwargs – keyword args denoting new_column=assignment_function pairs
- Returns
just return itself again (so we can use piping)
- Return type
- build(**kwargs)[source]
Build into a scikit-learn Pipeline
- Returns
An sklearn Pipeline that supports .fit, .transform
- Return type
Pipeline
- cast(cols, dtype)[source]
Cast column to another data type.
- Parameters
cols (ColumnSelector) – [description]
**kwargs – arguments for the actual transformer
- Returns
just return itself again (so we can use piping)
- Return type
- concat(pipe)[source]
Concatenate output of this pipeline to another.
Where adding/appending extends the pipeline, concat keeps parallel pipelines and concatenates their outcomes.
- encode_date(cols, **kwargs)[source]
A date cannot be used unless you encode it into features.
This encoder creates new features out of the year, month, day etc.
- Parameters
cols ([type]) – [description]
**kwargs – optional keywords like <datepart>=True/False, indicating whether to use dt.<datepart> as a new feature
- Returns
[description]
- Return type
- fillna(cols, value)[source]
Alias/shortcut for impute with constant value (after pandas’ .fillna).
This implementation doesn’t use pandas.DataFrame.fillna(), but sklearn’s SimpleImputer
- Parameters
cols (ColumnSelector) – _description_
- Returns
just return itself again (so we can use piping)
- Return type
- impute(cols, **kwargs)[source]
Skippa wrapper around sklearn’s SimpleImputer
- Parameters
cols (ColumnSelector) – [description]
- Returns
just return itself again (so we can use piping)
- Return type
- label_encode(cols, **kwargs)[source]
Wrapper around sklearn’s LabelEncoder
- Parameters
cols (ColumnSelector) – columns specification
**kwargs – optional kwargs for LabelEncoder
- Returns
just return itself again (so we can use piping)
- Return type
- static load(path)[source]
Load a previously saved skippa
N.B. dill is used for (de)serialization, because joblib/pickle doesn’t support things like lambda functions.
- Parameters
path (PathLike) – pathamae, either string or pathlib.Path
- Returns
an sklearn Pipeline
- Return type
Pipeline
- static load_pipeline(path)[source]
Load a previously saved pipeline
N.B. dill is used for (de)serialization, because joblib/pickle doesn’t support things like lambda functions.
- Parameters
path (PathLike) – pathname, either string or pathlib.Path
- Returns
an extended sklearn Pipeline
- Return type
- model(model)[source]
Add a model estimator.
A model estimator is always the last step in the pipeline! Therefore this doesn’t return the Skippa object (self) but calls the .build method to return the pipeline.
- Parameters
model (BaseEstimator) – An sklearn estimator
- Returns
a built pipeline
- Return type
- onehot(cols, **kwargs)[source]
Skippa wrapper around sklearn’s OneHotEncoder
- Parameters
cols (ColumnSelector) – columns specification
**kwargs – optional kwargs for OneHotEncoder (although ‘sparse’ will always be set to False)
- Returns
just return itself again (so we can use piping)
- Return type
- ordinal_encode(cols, **kwargs)[source]
Wrapper around sklearn’s OrdinalEncoder
- Parameters
cols (ColumnSelector) – columns specification
**kwargs – optional kwargs for OrdinalEncoder
- Returns
just return itself again (so we can use piping)
- Return type
- pca(cols, **kwargs)[source]
Wrapper around sklearn.decomposition.PCA
- Parameters
cols (ColumnSelector) – columns expression
kwargs – any kwargs to be used by PCA’s __init__
- Returns
just return itself again (so we can use piping)
- Return type
- rename(*args, **kwargs)[source]
Rename certain columns.
Two ways to use this: - a dict which defines a mapping {existing_col: new_col} - a column selector and a renaming function (e.g. [‘a’, ‘b’, ‘c’], lambda c: f’new_{c}’) It adds an XRenamer step, which wraps around pandas.rename
- Returns
just return itself again (so we can use piping)
- Return type
- scale(cols, type='standard', **kwargs)[source]
Skippa wrapper around sklearn’s StandardScaler / MinMaxScaler
- Parameters
cols (ColumnSelector) – [description]
type (str, optional) – One of [‘standard’, ‘minmax’]. Defaults to ‘standard’.
- Raises
ValueError – if an unknown/unsupported scaler type is passed
- Returns
just return itself again (so we can use piping)
- Return type
- select(cols)[source]
Apply a column selection
- Parameters
cols (ColumnSelector) – [description]
- Returns
just return itself again (so we can use piping)
- Return type
- class skippa.pipeline.SkippaPipeline(steps, *, memory=None, verbose=False)[source]
Bases:
Pipeline
Extension of sklearn’s Pipeline object.
While the Skippa class is for creating pipelines, it is not a pipeline itself. Only after adding a model estimator step, or by calling .build explicitly, is a SkippaPipeline created. This is basically an sklearn Pipeline with some added methods.
- create_gradio_app(**kwargs)[source]
Create a Gradio app for model inspection.
- Parameters
**kwargs – kwargs received by Gradio’s Interface() initialisation
- Returns
Gradio Interface object -> call .launch to start the app
- Return type
gr.Interface
- get_data_profile()[source]
The DataProfile is used in the Gradio app.
The profile contains information on column names, their dtypes and value ranges.
- Raises
NotFittedError – If pipeline has not been fitted there is no data profile yet.
- Returns
Simple object containing necessary info
- Return type
- get_model()[source]
Get the model estimator part of the pipeline.
So that you can access info like coefficients e.d.
- Returns
fitted model
- Return type
BaseEstimator
- get_pipeline_params(params)[source]
Translate model param grid to Pipeline param grid.
For GridSearch over a Pipeline, you need to sdupply a param grid in the form of { <stepname>__<paramname>: values } Since it’s non-trivial to find the name of the model/estimator step in the Pipeline, this auto detects it and return a new param grid in the right format.
- Parameters
params (Dict) – param grid with parameter names containing only the model parameter
- Returns
param grid with parameter names relating to both the pipeline step and the model parameter
- Return type
Dict
- steps: List[Any]
- test(X, up_to_step=-1)[source]
Test what happens to data in a pipeline.
This allows you to execute the pipeline up & until the last step before modeling (or any other step) and get the resulting data.
- Parameters
X (_type_) – _description_
up_to_step (int, optional) – _description_. Defaults to -1.
- Returns
_description_
- Return type
pd.DataFrame
skippa.profile module
DataProfile is used for storing and retrieving metadata of data that is used in the pipeline. Typically the DataProfile is created during fitting of a pipeline. The profile is used by the Gradio app that can be created.
skippa.utils module
- skippa.utils.get_dummy_data(nrows=100, nfloat=4, nint=2, nchar=3, ndate=1, missing=True, binary_y=True)[source]
Create dummy data.
- Parameters
nrows (int, optional) – Number of total rows. Defaults to 100.
nfloat (int, optional) – Number of float columns. Defaults to 4.
nint (int, optional) – Number of integer columns. Defaults to 2.
nchar (int, optional) – Number of character/categorical columns. Defaults to 3.
ndate (int, optional) – Number of date columns. Defaults to 1.
binary_y (bool, optional) – If True, returns 0’s & 1’s for y, otherwise float values between 0 & 100
- Returns
A pandas DataFrame for features and a numpy array for labels
- Return type
Tuple[pd.DataFrame, np.ndarray]
Module contents
Top-level package for skippa.
The pipeline module defines the main Skippa methods The transformers subpackage contains various transformers used in the pipeline.