Machine Learning

Systems Architect,

PhD Mathematician

Projects

Open Source

Blog

Résumé

GitHub

Photo by Johannes Plenio on Unsplash

Quantile regression is now supported in the latest version (0.3.0) of `skranger`

. This feature was available in the R package, but didn’t make its way into the python package until just recently.

The implementation comes from Meinshausen’s 2006 paper on the topic, titled *Quantile Regression Forests*.

In regression forests, each leaf node of each tree records the average target value of the observations that drop down to it. For quantile regression, each leaf node records *all* target values. This allows computation of quantiles from new observations by evaluating the quantile at the terminal node of each tree and averaging the values.

This implementation requires a lot of memory and a lot of computation. In lieu of this, we randomly sample one observed target value for each terminal node. We can do this because in the limit of the number of trees, the underlying weights calculated across training samples correspond to the probability that a particular sample is selected for each terminal node. This means that this approximation is more accurate as the number of trees increases while being *significantly* computationally faster and having a much smaller memory footprint.

The implementation in the ranger R package is based off of the code in the quantregForest R package. The approximate method is explained here. `skranger`

's code uses the same implementation in python using numpy.

The `RangerForestRegressor`

predictor uses `ranger`

's ForestRegression class. It also supports quantile regression using the `predict_quantiles`

method.

As far as I know, this method is unique to `skranger`

. The closest implementation of quantile regression I could find was the GradientBoostingRegressor in `sklearn`

, which requires using the `set_params()`

method to alter the quantile value between calls to `predict()`

.

Here is an example of quantile regression with `RangerForestRegressor`

using the boston housing dataset. Quantile regression must be enabled explicitly at instantiation time with the kwarg `quantiles=True`

:

```
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from skranger.ensemble import RangerForestRegressor
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
rfr = RangerForestRegressor()
rfr.fit(X_train, y_train)
predictions = rfr.predict(X_test)
print(predictions)
# [18.39205325 21.41698333 14.29509221 35.34981667 27.64378333 20.98569135
# 21.15996673 14.0288093 9.44657947 29.99185 19.3774 11.88189465
# ...
# 11.08502822 36.80993636 18.29633154 12.90448354 20.94311667 11.45154934
# 41.44466667]
# enable quantile regression on instantiation
rfr = RangerForestRegressor(quantiles=True)
rfr.fit(X_train, y_train)
quantile_lower = rfr.predict_quantiles(X_test, quantiles=[0.1])
print(quantile_lower)
# [12.9 17. 8. 28. 22. 10.9 7. 8. 5. 20.8 16.9 7. 8. 18.
# 22. 19. 29. 21. 19. 19. 22. 10.9 20. 16. 14. 20. 9.8 22.9
# ...
# 16. 17. 12. 20. 13. 26. 19. 21.9 7. 14.9 13. 8. 17.9 7.9
# 29. ]
quantile_upper = rfr.predict_quantiles(X_test, quantiles=[0.9])
print(quantile_upper)
# [23. 27. 21. 44. 32.1 50. 50. 18.2 12. 43. 22. 17. 17. 24.
# 31.1 25. 37. 28. 23. 24. 28. 18. 28. 23. 23. 26. 17.1 43.
# ...
# 22. 24. 20. 28. 18. 44.2 24. 33.4 15.1 50. 21. 17. 25. 13.
# 50. ]
```