Machine Learning
Systems Architect,
PhD Mathematician
Following initial work on skranger, I came across a fork of the ranger project called grf. The grf fork also provides fast C++ implementations of random forest predictors, with an accompanying R package.
grf supports what is known as honest estimation, in which different subsets of the training data are used for determining the splits of each tree node and then populating the leaf nodes of the trees. Training honest forests helps reduce bias in estimation. The grf docs provide a good explanation of how honesty is implemented.
Since grf is a fork of ranger, it was relatively easy to build bindings by reusing some code from skranger. In fact much of the approach taken for skgrf is similar to skranger. These bindings also provide an interface compatible with scikit-learn.
While ranger provides predictors for regression, classification, and survival estimation, grf provides additional estimators for local linear regression, instrumental regression, and causal regression.
Here is a simple example using the standard regressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from skgrf.ensemble import GRFRegressor
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
rfr = GRFRegressor()
rfr.fit(X_train, y_train)
predictions = rfr.predict(X_test)
print(predictions)
# [31.81349144 32.2734354 16.51560285 11.90284392 39.69744341 21.30367911
# 19.52732937 15.82126562 26.49528961 11.27220097 16.02447197 20.01224404
# ...
# 20.70674263 17.09041289 12.89671205 20.79787926 21.18317924 25.45553279
# 20.82455595]
skgrf
is available on pypi
and can be installed using pip
:
pip install skgrf
The grf software, being derivative of ranger, is licensed under GPLv3, thus skgrf is also licensed under GPLv3.