flynn.gg

Christopher Flynn

Machine Learning
Systems Architect,
PhD Mathematician

Home
Projects
Open Source
Blog
Résumé

GitHub
LinkedIn

Blog


Quantile Regression Forests in skranger

2020-10-28 Feed

forest Photo by Johannes Plenio on Unsplash

Quantile regression is now supported in the latest version (0.3.0) of skranger. This feature was available in the R package, but didn’t make its way into the python package until just recently.

How it works

The implementation comes from Meinshausen’s 2006 paper on the topic, titled Quantile Regression Forests.

In regression forests, each leaf node of each tree records the average target value of the observations that drop down to it. For quantile regression, each leaf node records all target values. This allows computation of quantiles from new observations by evaluating the quantile at the terminal node of each tree and averaging the values.

This implementation requires a lot of memory and a lot of computation. In lieu of this, we randomly sample one observed target value for each terminal node. We can do this because in the limit of the number of trees, the underlying weights calculated across training samples correspond to the probability that a particular sample is selected for each terminal node. This means that this approximation is more accurate as the number of trees increases while being significantly computationally faster and having a much smaller memory footprint.

The implementation in the ranger R package is based off of the code in the quantregForest R package. The approximate method is explained here. skranger's code uses the same implementation in python using numpy.

How to use it

The RangerForestRegressor predictor uses ranger's ForestRegression class. It also supports quantile regression using the predict_quantiles method.

As far as I know, this method is unique to skranger. The closest implementation of quantile regression I could find was the GradientBoostingRegressor in sklearn, which requires using the set_params() method to alter the quantile value between calls to predict().

Here is an example of quantile regression with RangerForestRegressor using the boston housing dataset. Quantile regression must be enabled explicitly at instantiation time with the kwarg quantiles=True:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from skranger.ensemble import RangerForestRegressor

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

rfr = RangerForestRegressor()
rfr.fit(X_train, y_train)

predictions = rfr.predict(X_test)
print(predictions)
# [18.39205325 21.41698333 14.29509221 35.34981667 27.64378333 20.98569135
#  21.15996673 14.0288093   9.44657947 29.99185    19.3774     11.88189465
#  ...
#  11.08502822 36.80993636 18.29633154 12.90448354 20.94311667 11.45154934
#  41.44466667]

# enable quantile regression on instantiation
rfr = RangerForestRegressor(quantiles=True)
rfr.fit(X_train, y_train)

quantile_lower = rfr.predict_quantiles(X_test, quantiles=[0.1])
print(quantile_lower)
# [12.9 17.   8.  28.  22.  10.9  7.   8.   5.  20.8 16.9  7.   8.  18.
#  22.  19.  29.  21.  19.  19.  22.  10.9 20.  16.  14.  20.   9.8 22.9
#  ...
#  16.  17.  12.  20.  13.  26.  19.  21.9  7.  14.9 13.   8.  17.9  7.9
#  29. ]
quantile_upper = rfr.predict_quantiles(X_test, quantiles=[0.9])
print(quantile_upper)
# [23.  27.  21.  44.  32.1 50.  50.  18.2 12.  43.  22.  17.  17.  24.
#  31.1 25.  37.  28.  23.  24.  28.  18.  28.  23.  23.  26.  17.1 43.
#  ...
#  22.  24.  20.  28.  18.  44.2 24.  33.4 15.1 50.  21.  17.  25.  13.
#  50. ]

Further reading

quantile regression

ranger

skranger

python

Back to the posts.