JPMML-SkLearn 
Java library and command-line application for converting Scikit-Learn pipelines to PMML.
Table of Contents
Features
Overview
- Functionality:
- Three times more supported Python packages, transformers and estimators than all the competitors combined!
- Thorough collection, analysis and encoding of feature information:
- Names.
- Data and operational types.
- Valid, invalid and missing value spaces.
- Descriptive statistics.
- Pipeline extensions:
- Pruning.
- Decision engineering (prediction post-processing).
- Model verification.
- Conversion options.
- Extensibility:
- Rich Java APIs for developing custom converters.
- Automatic discovery and registration of custom converters based on
META-INF/sklearn2pmml.propertiesresource files. - Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM and JPMML-XGBoost.
- Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.
Supported packages
Scikit-Learn
Examples: main.py
- Clustering:
- Composite estimators:
- Matrix decomposition:
- Discriminant analysis:
- Dummies:
- Ensemble methods:
ensemble.AdaBoostRegressorensemble.BaggingClassifierensemble.BaggingRegressorensemble.ExtraTreesClassifierensemble.ExtraTreesRegressorensemble.GradientBoostingClassifierensemble.GradientBoostingRegressorensemble.HistGradientBoostingClassifierensemble.HistGradientBoostingRegressorensemble.IsolationForestensemble.RandomForestClassifierensemble.RandomForestRegressorensemble.StackingClassifierensemble.StackingRegressorensemble.VotingClassifierensemble.VotingRegressor
- Feature extraction:
- Feature selection:
feature_selection.GenericUnivariateSelect(only viasklearn2pmml.SelectorProxy)feature_selection.RFE(only viasklearn2pmml.SelectorProxy)feature_selection.RFECV(only viasklearn2pmml.SelectorProxy)feature_selection.SelectFdr(only viasklearn2pmml.SelectorProxy)feature_selection.SelectFpr(only viasklearn2pmml.SelectorProxy)feature_selection.SelectFromModel(either directly or viasklearn2pmml.SelectorProxy)feature_selection.SelectFwe(only viasklearn2pmml.SelectorProxy)feature_selection.SelectKBest(either directly or viasklearn2pmml.SelectorProxy)feature_selection.SelectPercentile(only viasklearn2pmml.SelectorProxy)feature_selection.VarianceThreshold(only viasklearn2pmml.SelectorProxy)
- Impute:
- Isotonic regression:
- Generalized linear models:
linear_model.ARDRegressionlinear_model.BayesianRidgelinear_model.ElasticNetlinear_model.ElasticNetCVlinear_model.GammaRegressorlinear_model.HuberRegressorlinear_model.Larslinear_model.LarsCVlinear_model.Lassolinear_model.LassoCVlinear_model.LassoLarslinear_model.LassoLarsCVlinear_model.LinearRegressionlinear_model.LogisticRegressionlinear_model.LogisticRegressionCVlinear_model.OrthogonalMatchingPursuitlinear_model.OrthogonalMatchingPursuitCVlinear_model.PoissonRegressorlinear_model.Ridgelinear_model.RidgeCVlinear_model.RidgeClassifierlinear_model.RidgeClassifierCVlinear_model.SGDClassifierlinear_model.SGDRegressorlinear_model.TheilSenRegressor
- Model selection:
- Multiclass classification:
- Naive Bayes:
- Nearest neighbors:
- Pipelines:
- Neural network models:
- Preprocessing and normalization:
preprocessing.Binarizerpreprocessing.FunctionTransformerpreprocessing.Imputerpreprocessing.KBinsDiscretizerpreprocessing.LabelBinarizerpreprocessing.LabelEncoderpreprocessing.MaxAbsScalerpreprocessing.MinMaxScalerpreprocessing.OneHotEncoderpreprocessing.OrdinalEncoderpreprocessing.PolynomialFeaturespreprocessing.PowerTransformerpreprocessing.RobustScalerpreprocessing.StandardScaler
- Support vector machines:
- Decision trees:
Category Encoders
Examples: extensions/category_encoders.py
H2O.ai
Examples: main-h2o.py
h2o.estimators.gbm.H2OGradientBoostingEstimatorh2o.estimators.glm.H2OGeneralizedLinearEstimatorh2o.estimators.isolation_forest.H2OIsolationForestEstimatorh2o.estimators.random_forest.H2ORandomForestEstimatorh2o.estimators.stackedensemble.H2OStackedEnsembleEstimatorh2o.estimators.xgboost.H2OXGBoostEstimator
Imbalanced-Learn
Examples: extensions/imblearn.py
- Under-sampling methods:
imblearn.under_sampling.AllKNNimblearn.under_sampling.ClusterCentroidsimblearn.under_sampling.CondensedNearestNeighbourimblearn.under_sampling.EditedNearestNeighboursimblearn.under_sampling.InstanceHardnessThresholdimblearn.under_sampling.NearMissimblearn.under_sampling.NeighbourhoodCleaningRuleimblearn.under_sampling.OneSidedSelectionimblearn.under_sampling.RandomUnderSamplerimblearn.under_sampling.RepeatedEditedNearestNeighboursimblearn.under_sampling.TomekLinks
- Over-sampling methods:
- Combination of over- and under-sampling methods:
- Ensemble methods:
- Pipeline:
LightGBM
Examples: main-lightgbm.py
Scikit-Lego
Examples: extensions/sklego.py
sklego.meta.EstimatorTransformer- Predict functions
apply,decision_function,predict.
- Predict functions
sklego.preprocessing.IdentityTransformer
SkLearn2PMML
Examples: main.py and extensions/sklearn2pmml.py
- Helpers:
sklearn2pmml.EstimatorProxysklearn2pmml.SelectorProxy
- Feature specification and decoration:
sklearn2pmml.decoration.Aliassklearn2pmml.decoration.CategoricalDomainsklearn2pmml.decoration.ContinuousDomainsklearn2pmml.decoration.ContinuousDomainErasersklearn2pmml.decoration.DateDomainsklearn2pmml.decoration.DateTimeDomainsklearn2pmml.decoration.DiscreteDomainErasersklearn2pmml.decoration.MultiDomainsklearn2pmml.decoration.OrdinalDomain
- Ensemble methods:
sklearn2pmml.ensemble.GBDTLMRegressor- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
LGBMRegressor,XGBRegressor,XGBRFRegressor. - The LM side: A Scikit-Learn linear regressor (eg.
ElasticNet,LinearRegression,SGDRegressor).
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
sklearn2pmml.ensemble.GBDTLRClassifier- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
LGBMClassifier,XGBClassifier,XGBRFClassifier. - The LR side: A Scikit-Learn binary linear classifier (eg.
LinearSVC,LogisticRegression,SGDClassifier).
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
sklearn2pmml.ensemble.SelectFirstClassifiersklearn2pmml.ensemble.SelectFirstRegressor
- Feature selection:
sklearn2pmml.feature_selection.SelectUnique
- Neural networks:
sklearn2pmml.neural_network.MLPTransformer
- Pipeline:
sklearn2pmml.pipeline.PMMLPipeline
- Postprocessing:
sklearn2pmml.postprocessing.BusinessDecisionTransformer
- Preprocessing:
sklearn2pmml.preprocessing.Aggregatorsklearn2pmml.preprocessing.CastTransformersklearn2pmml.preprocessing.ConcatTransformersklearn2pmml.preprocessing.CutTransformersklearn2pmml.preprocessing.DaysSinceYearTransformersklearn2pmml.preprocessing.ExpressionTransformer- Ternary conditional expression
<expression_true> if <condition> else <expression_false>. - Array indexing expressions
X[<column index>]andX[<column name>]. - String concatenation expressions.
- String slicing expressions
<str>[<start>:<stop>]. - Arithmetic operators
+,-,*,/and%. - Identity comparison operators
is Noneandis not None. - Comparison operators
in <list>,not in <list>,<=,<,==,!=,>and>=. - Logical operators
and,orandnot. - Numpy function
numpy.where. - Numpy universal functions.
- Pandas functions
pandas.isnullandpandas.notnull. - Scipy functions
scipy.special.expitandscipy.special.logit. - String functions
startswith(<prefix>),endswith(<suffix>),lower,upperandstrip. - String length function
len(<str>)
- Ternary conditional expression
sklearn2pmml.preprocessing.FilterLookupTransformersklearn2pmml.preprocessing.LookupTransformersklearn2pmml.preprocessing.MatchesTransformersklearn2pmml.preprocessing.MultiLookupTransformersklearn2pmml.preprocessing.PMMLLabelBinarizersklearn2pmml.preprocessing.PMMLLabelEncodersklearn2pmml.preprocessing.PowerFunctionTransformersklearn2pmml.preprocessing.ReplaceTransformersklearn2pmml.preprocessing.SecondsSinceMidnightTransformersklearn2pmml.preprocessing.SecondsSinceYearTransformersklearn2pmml.preprocessing.StringNormalizersklearn2pmml.preprocessing.SubstringTransformersklearn2pmml.preprocessing.WordCountTransformersklearn2pmml.preprocessing.h2o.H2OFrameCreatorsklearn2pmml.preprocessing.scipy.BSplineTransformersklearn2pmml.util.Reshaper
- Rule sets:
sklearn2pmml.ruleset.RuleSetClassifier
XGBoost
Examples: main-xgboost.py
Prerequisites
The Python side of operations
- Python 2.7, 3.4 or newer.
scikit-learn0.16.0 or newer.sklearn-pandas0.0.10 or newer.sklearn2pmml0.14.0 or newer.
Validating Python installation:
import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml
print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)The JPMML-SkLearn side of operations
- Java 1.8 or newer.
Installation
Enter the project root directory and build using Apache Maven:
mvn clean install
The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.7-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar.
Usage
A typical workflow can be summarized as follows:
- Use Python to train a model.
- Serialize the model in
pickledata format to a file in a local filesystem. - Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.
The Python side of operations
Loading data to a pandas.DataFrame object:
import pandas
df = pandas.read_csv("Iris.csv")
iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain
column_preprocessor = DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy
table_preprocessor = Pipeline([
("pca", PCA(n_components = 3)),
("selector", SelectorProxy(SelectKBest(k = 2)))
])Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.
Third, creating an Estimator object:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(min_samples_leaf = 5)Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("columns", column_preprocessor),
("table", table_preprocessor),
("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)Recording feature importance information in a pickle data format-compatible manner:
classifier.pmml_feature_importances_ = classifier.feature_importances_Embedding model verification data:
pipeline.verify(iris_X.sample(n = 15))Storing the fitted PMMLPipeline object in pickle data format:
from sklearn.externals import joblib
joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.
The JPMML-SkLearn side of operations
Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:
java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml
Getting help:
java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --help
Documentation
Up-to-date:
- Benchmarking Scikit-Learn against JPMML-Evaluator in Java and Python environments
- Extending Scikit-Learn with outlier detector transformer type
- Analyzing Scikit-Learn feature importances via PMML
- Training Scikit-Learn based TF(-IDF) plus XGBoost pipelines
- Converting Scikit-Learn based TF(-IDF) pipelines to PMML documents
- Converting Scikit-Learn based Imbalanced-Learn (imblearn) pipelines to PMML documents
- Extending Scikit-Learn with date and datetime features
- Extending Scikit-Learn with feature specifications
- Converting logistic regression models to PMML documents
- Stacking Scikit-Learn, LightGBM and XGBoost models
- Converting Scikit-Learn hyperparameter-tuned pipelines to PMML documents
- Extending Scikit-Learn with GBDT plus LR ensemble (GBDT+LR) model type
- Converting Scikit-Learn based TPOT automated machine learning (AutoML) pipelines to PMML documents
- Converting Scikit-Learn based LightGBM pipelines to PMML documents
- Extending Scikit-Learn with business rules (BR) model type
Slightly outdated:
License
JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.
Additional information
JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact info@openscoring.io