library(ggplot2)
library(dplyr)
library(tidyr)
library(randomForestSRC)
if (requireNamespace("ggRandomForests", quietly = TRUE)) {
library(ggRandomForests)
} else if (requireNamespace("pkgload", quietly = TRUE)) {
pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE)
} else {
stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.")
}
theme_set(theme_bw())Random Forest Regression with ggRandomForests
2026-04-28
Source:vignettes/ggRandomForests-regression.qmd
Work in progress
This vignette is under active development. Code examples and narrative may change before the next release.
Introduction
Random forests (Breiman 2001) are a non-parametric ensemble method that requires no distributional or functional assumptions on how covariates relate to the response. The method builds a large collection of de-correlated decision trees via bootstrap aggregation (bagging) and random feature selection, then averages their predictions to produce a stable, low-variance estimator. The randomForestSRC package (Ishwaran and Kogalur 2024) provides a unified implementation for survival, regression, and classification forests.
ggRandomForests extracts tidy data objects from rfsrc fits and renders them with ggplot2 (Wickham 2016), making it straightforward to explore how a forest is constructed, which variables matter, and how the response depends on individual predictors.
This vignette demonstrates a complete random forest regression workflow on the Boston Housing data set (Harrison and Rubinfeld 1978; Belsley et al. 1980):
- Data exploration — EDA scatter panels, variable descriptions
- Growing the forest — fitting an RF, checking OOB error convergence
-
Variable selection — VIMP and minimal depth via
max.subtree() -
Dependence plots — variable dependence and partial dependence via
gg_variable()andgg_partial_rfsrc() - Variable interactions — conditioning plots and interactive 3-D partial dependence surfaces with plotly
Data: Boston Housing Values
The Boston Housing data (Harrison and Rubinfeld 1978; Belsley et al. 1980) is a standard benchmark for regression. It contains data for 506 census tracts of Boston from the 1970 census. The objective is to predict the median value of owner-occupied homes (medv, in $1000s) from 13 predictors covering crime, zoning, industry, environmental quality, and housing characteristics. We use the copy from the MASS package (Venables and Ripley 2002).
data(Boston, package = "MASS")
Boston$chas <- as.logical(Boston$chas) # nolint: object_name_linter
st_labs <- c(
crim = "Crime rate by town",
zn = "Residential land zoned > 25k sq ft (%)",
indus = "Non-retail business acres (%)",
chas = "Borders Charles River",
nox = "Nitrogen oxides (10 ppm)",
rm = "Rooms per dwelling",
age = "Units built before 1940 (%)",
dis = "Distance to employment centers",
rad = "Highway accessibility index",
tax = "Property tax rate per $10,000",
ptratio = "Pupil-teacher ratio",
black = "Proportion of Black residents",
lstat = "Lower status population (%)",
medv = "Median home value ($1000s)"
)Exploratory data analysis
We plot each predictor against the response (medv), coloring by the sole categorical variable (chas, whether the tract borders the Charles River). A loess smooth highlights the marginal trend.
dta <- Boston |>
pivot_longer(c(-medv, -chas), names_to = "variable", values_to = "value")
ggplot(dta, aes(x = medv, y = value, color = chas)) +
geom_point(alpha = 0.4) +
geom_smooth(aes(x = medv, y = value), color = "grey30",
inherit.aes = FALSE, se = FALSE) +
labs(y = "", x = st_labs["medv"]) +
scale_color_brewer(palette = "Set2") +
facet_wrap(~variable, scales = "free_y", ncol = 3)
Even from this simple view, the strong relationship between medv and both lstat (lower status %) and rm (rooms per dwelling) is apparent. We expect the random forest to confirm these as the most important predictors.
Growing a Random Forest
We grow a regression forest using all 13 predictors. The rfsrc() function detects the regression family from the continuous response.
rfsrc_Boston <- rfsrc(medv ~ ., data = Boston, # nolint: object_name_linter
importance = TRUE, err.block = 5)
rfsrc_Boston#> Sample size: 506
#> Number of trees: 500
#> Forest terminal node size: 5
#> Average no. of terminal nodes: 67.006
#> No. of variables tried at each split: 5
#> Total no. of variables: 13
#> Resampling used to grow trees: swor
#> Resample size used to grow trees: 320
#> Analysis: RF-R
#> Family: regr
#> Splitting rule: mse *random*
#> Number of random split points: 10
#> (OOB) R squared: 0.86919169
#> (OOB) Requested performance error: 11.06464628
The forest grew 500 trees, splitting on 5 randomly selected candidate variables at each node, and stopping at a minimum terminal node size of 5.
OOB error convergence
gg_e <- gg_error(rfsrc_Boston)
gg_e <- gg_e |> filter(!is.na(error))
class(gg_e) <- c("gg_error", class(gg_e))
plot(gg_e)
The error stabilizes well before 500 trees, indicating the forest is large enough for reliable predictions.
OOB predictions
plot(gg_rfsrc(rfsrc_Boston), alpha = 0.5) +
coord_cartesian(ylim = c(5, 49))
Each point is a single tract’s OOB prediction. The distribution is a sanity check — we are more interested in the why behind these predictions.
Variable Selection
Variable importance (VIMP)
VIMP measures the increase in prediction error when a variable is randomly permuted (Breiman 2001). Large positive values mean the variable is essential; negative values suggest noise is more informative.

lstat and rm dominate, with a clear gap to the remaining predictors. All VIMP values are positive, indicating every predictor contributes at least marginally.
Minimal depth
Minimal depth (Ishwaran et al. 2010) ranks variables by how close to the root node they first split, on average. Variables that partition large portions of the population early are considered most important.
md_Boston <- max.subtree(rfsrc_Boston) # nolint: object_name_linterThe threshold is 3.01, selecting 6 variables: crim, nox, rm, dis, ptratio, lstat.
Both VIMP and minimal depth agree on the dominance of lstat and rm. We use the minimal depth top variables for the remainder of the analysis.
xvar <- md_Boston$topvarsVariable Dependence
Variable dependence plots
Variable dependence shows each tract’s OOB predicted medv plotted against a predictor, with a loess smooth indicating the trend.
gg_v <- gg_variable(rfsrc_Boston)
plot(gg_v, xvar = xvar, panel = TRUE, alpha = 0.5) +
labs(y = st_labs["medv"], x = "")
The panels confirm what EDA suggested: medv decreases sharply with lstat and increases with rm, both in strongly non-linear ways. The remaining variables show weaker but still discernible trends.

Most tracts do not border the Charles River, and the predicted value distributions largely overlap — consistent with chas ranking last in both VIMP and minimal depth.
Partial dependence
Partial dependence integrates out the effects of all other covariates, giving a risk-adjusted view of each predictor’s marginal effect (Friedman 2001):
We use gg_partial_rfsrc(), which calls randomForestSRC::partial.rfsrc() directly and returns a gg_partial_rfsrc object. The quickest path to a plot is plot(pd), which handles continuous and categorical variables automatically. For a custom layout we can also access pd$continuous directly.
pd <- gg_partial_rfsrc(rfsrc_Boston, xvar.names = xvar)
# Quick S3 plot — works out of the box for the standard regression case
plot(pd)
For a publication-ready layout with custom axis labels, access the underlying data frame directly:
ggplot(pd$continuous, aes(x = x, y = yhat)) +
geom_line(color = "steelblue", linewidth = 1) +
facet_wrap(~name, scales = "free_x") +
labs(y = st_labs["medv"], x = "") +
theme_bw()
lstat shows a strongly concave relationship, while rm has a flat region below ~6 rooms before increasing sharply. These non-linear shapes are difficult to capture with simple parametric transforms but are naturally handled by the random forest.
Variable Interactions and Conditioning Plots
Conditioning on a categorical variable
The simplest coplot conditions on a categorical variable. Here we examine medv vs. lstat, split by Charles River status:
gg_v$chas_label <- ifelse(gg_v$chas, "Borders Charles River",
"Does not border")
plot(gg_v, xvar = "lstat", alpha = 0.5) +
labs(y = st_labs["medv"], x = st_labs["lstat"]) +
theme(legend.position = "none") +
facet_wrap(~chas_label)
The decreasing trend holds in both groups, with slightly higher values along the Charles River at every lstat level.
Conditioning on a continuous variable
To investigate the lstat–rm interaction, we bin rm into six quantile groups using quantile_pts() and facet:
rm_pts <- quantile_pts(rfsrc_Boston$xvar$rm, groups = 6, intervals = TRUE)
gg_v$rm_grp <- cut(rfsrc_Boston$xvar$rm, breaks = rm_pts)
levels(gg_v$rm_grp) <- paste("rm in", levels(gg_v$rm_grp))
plot(gg_v, xvar = "lstat", alpha = 0.5) +
labs(y = st_labs["medv"], x = st_labs["lstat"]) +
theme(legend.position = "none") +
scale_color_brewer(palette = "Set3") +
facet_wrap(~rm_grp)
Median values decrease with lstat within every rm group, but the intercept shifts upward with more rooms. Smaller homes in low-lstat (high-status) neighborhoods still command high prices.
The complement view — medv vs. rm, conditional on lstat groups — completes the picture:
lstat_pts <- quantile_pts(rfsrc_Boston$xvar$lstat, groups = 6,
intervals = TRUE)
gg_v$lstat_grp <- cut(rfsrc_Boston$xvar$lstat, breaks = lstat_pts)
levels(gg_v$lstat_grp) <- paste("lstat in", levels(gg_v$lstat_grp))
plot(gg_v, xvar = "rm", alpha = 0.5) +
labs(y = st_labs["medv"], x = st_labs["rm"]) +
theme(legend.position = "none") +
scale_color_brewer(palette = "Set3") +
facet_wrap(~lstat_grp)
The rm effect is strongest in low-lstat tracts (bottom-left panels) and nearly flat in high-lstat tracts, confirming a meaningful interaction.
Interactive Partial Dependence Surface
To visualize the joint partial dependence of medv on lstat and rm, we compute partial dependence on a grid: 25 values of rm, each evaluated at 25 points along lstat.
rm_grid <- quantile_pts(rfsrc_Boston$xvar$rm, groups = 25)
surface_list <- lapply(rm_grid, function(rm_val) {
newx <- rfsrc_Boston$xvar
newx$rm <- rm_val
pd_rm <- gg_partial_rfsrc(rfsrc_Boston, xvar.names = "lstat", newx = newx)
df <- pd_rm$continuous
df$rm <- rm_val
df
})
surface_df <- bind_rows(surface_list)
if (requireNamespace("plotly", quietly = TRUE)) {
library(plotly)
surface_wide <- surface_df |>
select(lstat = x, rm, medv = yhat) |>
arrange(rm, lstat)
lstat_vals <- sort(unique(surface_wide$lstat))
rm_vals <- sort(unique(surface_wide$rm))
z_matrix <- matrix(surface_wide$medv,
nrow = length(rm_vals),
ncol = length(lstat_vals),
byrow = TRUE)
plot_ly(x = lstat_vals, y = rm_vals, z = z_matrix) |>
add_surface(colorscale = "Viridis", showscale = TRUE) |>
layout(
scene = list(
xaxis = list(title = "Lower Status (%)"),
yaxis = list(title = "Rooms per Dwelling"),
zaxis = list(title = "Median Value ($1000s)")
)
)
} else {
message("Install the plotly package for interactive 3D surfaces.")
ggplot(surface_df, aes(x = x, y = rm, fill = yhat)) +
geom_tile() +
scale_fill_viridis_c(name = "Median Value") +
labs(x = "Lower Status (%)", y = "Rooms per Dwelling") +
theme_bw()
}Interactive partial dependence surface: median home value as a function of lstat and rm.
The surface confirms the strong interaction: home values are highest when lstat is low and rm is high (upper-left corner), dropping steeply along both axes. The non-planar curvature — particularly the sharp step near rm = 7 — demonstrates the kind of complex, non-linear structure that random forests capture naturally.
Conclusion
This vignette demonstrated a complete random forest regression analysis using randomForestSRC and ggRandomForests:
-
OOB error via
gg_error()showed the forest stabilized quickly. -
VIMP via
gg_vimp()and minimal depth viamax.subtree()both identifiedlstatandrmas the dominant predictors, with a clear gap to the remaining variables. -
Variable dependence via
gg_variable()revealed strongly non-linear predictor–response relationships that match the raw data EDA. -
Partial dependence via
gg_partial_rfsrc()provided risk-adjusted confirmation, showing concavelstatand threshold-likermeffects. -
Conditioning plots and interactive surfaces exposed the
lstat–rminteraction, where the room-size effect is strongest in high-status neighborhoods.
The ggRandomForests design separates data extraction from plotting: every gg_*() function returns a tidy data frame that can be plotted with the package’s plot() methods or fed directly into custom ggplot2 workflows.