sk-dist
is a Python package for machine learning built on top of
scikit-learn and is
distributed under the Apache 2.0 software
license. The
sk-dist
module can be thought of as "distributed scikit-learn" as
its core functionality is to extend the scikit-learn
built-in
joblib
parallelization of meta-estimator training to
spark. A popular use case is the
parallelization of grid search as shown here:
Check out the blog post
for more information on the motivation and use cases of sk-dist
.
sk-dist
parallelizes the training of
scikit-learn
meta-estimators with PySpark. This allows
distributed training of these estimators without any constraint on
the physical resources of any one machine. In all cases, spark
artifacts are automatically stripped from the fitted estimator. These
estimators can then be pickled and un-pickled for prediction tasks,
operating identically at predict time to their scikit-learn
counterparts. Supported tasks are:sk-dist
provides a prediction module
which builds vectorized
UDFs
for
PySpark
DataFrames
using fitted scikit-learn
estimators. This distributes the
predict
and predict_proba
methods of scikit-learn
estimators, enabling large scale prediction with scikit-learn
.sk-dist
provides a flexible feature
encoding utility called Encoderizer
which encodes mix-typed
feature spaces using either default behavior or user defined
customizable settings. It is particularly aimed at text features, but
it additionally handles numeric and dictionary type feature spaces.sk-dist
requires:
numpy
, scipy
and joblib
that are compatible with any supported version of scikit-learn
should be sufficient for sk-dist
sk-dist
is not supported with Python 2Most sk-dist
functionality requires a spark installation as well as
PySpark. Some functionality can run without spark, so spark related
dependencies are not required. The connection between sk-dist and spark
relies solely on a sparkContext
as an argument to various
sk-dist
classes upon instantiation.
A variety of spark configurations and setups will work. It is left up to
the user to configure their own spark setup. The testing suite runs
spark 2.4
and spark 3.0
, though any spark 2.0+
versions
are expected to work.
Additional spark related dependecies are pyarrow
, which is used only
for skdist.predict
functions. This uses vectorized pandas UDFs which
require pyarrow>=0.8.0
, tested with pyarrow==0.16.0
.
Depending on the spark version, it may be necessary to set
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
in the
spark configuration.
The easiest way to install sk-dist
is with pip
:
pip install --upgrade sk-dist
You can also download the source code:
git clone https://github.com/Ibotta/sk-dist.git
With pytest
installed, you can run tests locally:
pytest sk-dist
The package contains numerous
examples
on how to use sk-dist
in practice. Examples of note are:
sk-dist
has been tested with a number of popular gradient boosting packages that conform to the scikit-learn
API. This
includes xgboost
and catboost
. These will need to be installed in addition to sk-dist
on all nodes of the spark
cluster via a node bootstrap script. Version compatibility is left up to the user.
Support for lightgbm
is not guaranteed, as it requires additional installations on all
nodes of the spark cluster. This may work given proper installation but has not beed tested with sk-dist
.
The project was started at Ibotta Inc. on the machine learning team and open sourced in 2019.
It is currently maintained by the machine learning team at Ibotta. Special
thanks to those who contributed to sk-dist
while it was initially
in development at Ibotta:
Thanks to James Foley for logo artwork.
Вы можете оставить комментарий после Вход в систему
Неприемлемый контент может быть отображен здесь и не будет показан на странице. Вы можете проверить и изменить его с помощью соответствующей функции редактирования.
Если вы подтверждаете, что содержание не содержит непристойной лексики/перенаправления на рекламу/насилия/вульгарной порнографии/нарушений/пиратства/ложного/незначительного или незаконного контента, связанного с национальными законами и предписаниями, вы можете нажать «Отправить» для подачи апелляции, и мы обработаем ее как можно скорее.
Комментарии ( 0 )