Published by jpmml on March 9, 2022

Wednesday, March 9, 2022

JPMML-SkLearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Features

Overview

Functionality:
- Three times more supported Python packages, transformers and estimators than all the competitors combined!
- Thorough collection, analysis and encoding of feature information:
  - Names.
  - Data and operational types.
  - Valid, invalid and missing value spaces.
  - Descriptive statistics.
- Pipeline extensions:
  - Pruning.
  - Decision engineering (prediction post-processing).
  - Model verification.
- Conversion options.
Extensibility:
- Rich Java APIs for developing custom converters.
- Automatic discovery and registration of custom converters based on META-INF/sklearn2pmml.properties resource files.
- Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM and JPMML-XGBoost.
Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.

Supported packages

Scikit-Learn

Examples: main.py

Clustering:
- cluster.KMeans
- cluster.MiniBatchKMeans
Composite estimators:
- compose.ColumnTransformer
- compose.TransformedTargetRegressor
Matrix decomposition:
- decomposition.PCA
- decomposition.IncrementalPCA
- decomposition.TruncatedSVD
Discriminant analysis:
- discriminant_analysis.LinearDiscriminantAnalysis
Dummies:
- dummy.DummyClassifier
- dummy.DummyRegressor
Ensemble methods:
- ensemble.AdaBoostRegressor
- ensemble.BaggingClassifier
- ensemble.BaggingRegressor
- ensemble.ExtraTreesClassifier
- ensemble.ExtraTreesRegressor
- ensemble.GradientBoostingClassifier
- ensemble.GradientBoostingRegressor
- ensemble.HistGradientBoostingClassifier
- ensemble.HistGradientBoostingRegressor
- ensemble.IsolationForest
- ensemble.RandomForestClassifier
- ensemble.RandomForestRegressor
- ensemble.StackingClassifier
- ensemble.StackingRegressor
- ensemble.VotingClassifier
- ensemble.VotingRegressor
Feature extraction:
- feature_extraction.DictVectorizer
- feature_extraction.text.CountVectorizer
- feature_extraction.text.TfidfVectorizer
Feature selection:
- feature_selection.GenericUnivariateSelect (only via sklearn2pmml.SelectorProxy)
- feature_selection.RFE (only via sklearn2pmml.SelectorProxy)
- feature_selection.RFECV (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFdr (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFpr (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFromModel (either directly or via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFwe (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectKBest (either directly or via sklearn2pmml.SelectorProxy)
- feature_selection.SelectPercentile (only via sklearn2pmml.SelectorProxy)
- feature_selection.VarianceThreshold (only via sklearn2pmml.SelectorProxy)
Impute:
- impute.MissingIndicator
- impute.SimpleImputer
Isotonic regression:
- isotonic.IsotonicRegression
Generalized linear models:
- linear_model.ARDRegression
- linear_model.BayesianRidge
- linear_model.ElasticNet
- linear_model.ElasticNetCV
- linear_model.GammaRegressor
- linear_model.HuberRegressor
- linear_model.Lars
- linear_model.LarsCV
- linear_model.Lasso
- linear_model.LassoCV
- linear_model.LassoLars
- linear_model.LassoLarsCV
- linear_model.LinearRegression
- linear_model.LogisticRegression
- linear_model.LogisticRegressionCV
- linear_model.OrthogonalMatchingPursuit
- linear_model.OrthogonalMatchingPursuitCV
- linear_model.PoissonRegressor
- linear_model.Ridge
- linear_model.RidgeCV
- linear_model.RidgeClassifier
- linear_model.RidgeClassifierCV
- linear_model.SGDClassifier
- linear_model.SGDRegressor
- linear_model.TheilSenRegressor
Model selection:
- model_selection.GridSearchCV
- model_selection.RandomizedSearchCV
Multiclass classification:
- multiclass.OneVsRestClassifier
Naive Bayes:
- naive_bayes.GaussianNB
Nearest neighbors:
- neighbors.KNeighborsClassifier
- neighbors.KNeighborsRegressor
Pipelines:
- pipeline.FeatureUnion
- pipeline.Pipeline
Neural network models:
- neural_network.MLPClassifier
- neural_network.MLPRegressor
Preprocessing and normalization:
- preprocessing.Binarizer
- preprocessing.FunctionTransformer
- preprocessing.Imputer
- preprocessing.KBinsDiscretizer
- preprocessing.LabelBinarizer
- preprocessing.LabelEncoder
- preprocessing.MaxAbsScaler
- preprocessing.MinMaxScaler
- preprocessing.OneHotEncoder
- preprocessing.OrdinalEncoder
- preprocessing.PolynomialFeatures
- preprocessing.PowerTransformer
- preprocessing.RobustScaler
- preprocessing.StandardScaler
Support vector machines:
- svm.LinearSVC
- svm.LinearSVR
- svm.OneClassSVM
- svm.SVC
- svm.NuSVC
- svm.SVR
- svm.NuSVR
Decision trees:
- tree.DecisionTreeClassifier
- tree.DecisionTreeRegressor
- tree.ExtraTreeClassifier
- tree.ExtraTreeRegressor

Category Encoders

Examples: extensions/category_encoders.py

H2O.ai

Examples: main-h2o.py

Imbalanced-Learn

Examples: extensions/imblearn.py

Under-sampling methods:
Over-sampling methods:
Combination of over- and under-sampling methods:
- imblearn.combine.SMOTEENN
- imblearn.combine.SMOTETomek
Ensemble methods:
- imblearn.ensemble.BalancedBaggingClassifier
- imblearn,ensemble,BalancedRandomForestClassifier
Pipeline:
- imblearn.pipeline.Pipeline

LightGBM

Examples: main-lightgbm.py

Mlxtend

Examples: N/A

mlxtend.preprocessing.DenseTransformer

Scikit-Lego

Examples: extensions/sklego.py

sklego.meta.EstimatorTransformer
- Predict functions apply, decision_function, predict.
sklego.preprocessing.IdentityTransformer

SkLearn2PMML

Examples: main.py and extensions/sklearn2pmml.py

Helpers:
- sklearn2pmml.EstimatorProxy
- sklearn2pmml.SelectorProxy
Feature specification and decoration:
- sklearn2pmml.decoration.Alias
- sklearn2pmml.decoration.CategoricalDomain
- sklearn2pmml.decoration.ContinuousDomain
- sklearn2pmml.decoration.ContinuousDomainEraser
- sklearn2pmml.decoration.DateDomain
- sklearn2pmml.decoration.DateTimeDomain
- sklearn2pmml.decoration.DiscreteDomainEraser
- sklearn2pmml.decoration.MultiDomain
- sklearn2pmml.decoration.OrdinalDomain
Ensemble methods:
- sklearn2pmml.ensemble.GBDTLMRegressor
  - The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
  - The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
- sklearn2pmml.ensemble.GBDTLRClassifier
  - The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
  - The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
- sklearn2pmml.ensemble.SelectFirstClassifier
- sklearn2pmml.ensemble.SelectFirstRegressor
Feature selection:
- sklearn2pmml.feature_selection.SelectUnique
Neural networks:
- sklearn2pmml.neural_network.MLPTransformer
Pipeline:
- sklearn2pmml.pipeline.PMMLPipeline
Postprocessing:
- sklearn2pmml.postprocessing.BusinessDecisionTransformer
Preprocessing:
- sklearn2pmml.preprocessing.Aggregator
- sklearn2pmml.preprocessing.CastTransformer
- sklearn2pmml.preprocessing.ConcatTransformer
- sklearn2pmml.preprocessing.CutTransformer
- sklearn2pmml.preprocessing.DaysSinceYearTransformer
- sklearn2pmml.preprocessing.ExpressionTransformer
  - Ternary conditional expression <expression_true> if <condition> else <expression_false>.
  - Array indexing expressions X[<column index>] and X[<column name>].
  - String concatenation expressions.
  - String slicing expressions <str>[<start>:<stop>].
  - Arithmetic operators +, -, *, / and %.
  - Identity comparison operators is None and is not None.
  - Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
  - Logical operators and, or and not.
  - Numpy function numpy.where.
  - Numpy universal functions.
  - Pandas functions pandas.isnull and pandas.notnull.
  - Scipy functions scipy.special.expit and scipy.special.logit.
  - String functions startswith(<prefix>), endswith(<suffix>), lower, upper and strip.
  - String length function len(<str>)
- sklearn2pmml.preprocessing.FilterLookupTransformer
- sklearn2pmml.preprocessing.LookupTransformer
- sklearn2pmml.preprocessing.MatchesTransformer
- sklearn2pmml.preprocessing.MultiLookupTransformer
- sklearn2pmml.preprocessing.PMMLLabelBinarizer
- sklearn2pmml.preprocessing.PMMLLabelEncoder
- sklearn2pmml.preprocessing.PowerFunctionTransformer
- sklearn2pmml.preprocessing.ReplaceTransformer
- sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
- sklearn2pmml.preprocessing.SecondsSinceYearTransformer
- sklearn2pmml.preprocessing.StringNormalizer
- sklearn2pmml.preprocessing.SubstringTransformer
- sklearn2pmml.preprocessing.WordCountTransformer
- sklearn2pmml.preprocessing.h2o.H2OFrameCreator
- sklearn2pmml.preprocessing.scipy.BSplineTransformer
- sklearn2pmml.util.Reshaper
Rule sets:
- sklearn2pmml.ruleset.RuleSetClassifier

Sklearn-Pandas

Examples: main.py

sklearn_pandas.CategoricalImputer
sklearn_pandas.DataFrameMapper

TPOT

Examples: extensions/tpot.py

tpot.builtins.stacking_estimator.StackingEstimator

XGBoost

Examples: main-xgboost.py

Prerequisites

The Python side of operations

Python 2.7, 3.4 or newer.
scikit-learn 0.16.0 or newer.
sklearn-pandas 0.0.10 or newer.
sklearn2pmml 0.14.0 or newer.

Validating Python installation:

import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml

print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.7-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

Use Python to train a model.
Serialize the model in pickle data format to a file in a local filesystem.
Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
    ("pca", PCA(n_components = 3)),
    ("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Recording feature importance information in a pickle data format-compatible manner:

classifier.pmml_feature_importances_ = classifier.feature_importances_

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

from sklearn.externals import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --help

Documentation

Up-to-date:

Slightly outdated:

Converting Scikit-Learn to PMML

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io

jpmml-sklearn

JPMML-SkLearn

Table of Contents

Features

Overview

Supported packages

Prerequisites

The Python side of operations

The JPMML-SkLearn side of operations

Installation

Usage

The Python side of operations

The JPMML-SkLearn side of operations

Documentation

License

Additional information