27  Variable and partial dependence

27.1 When to use it

Variable importance (the previous chapter) ranks predictors but says nothing about how a predictor moves the prediction. Once VIMP tells you bilirubin matters, the next question is the shape of that effect: does predicted survival fall steadily as bilirubin climbs, drop off a cliff past some threshold, or level out at the high end? Dependence plots answer that. Reach for them whenever you have a forest that predicts well and you need to describe what a specific variable is doing, for a results figure, a clinical discussion, or your own understanding of the model.

Two plots fill the gap, and the difference between them is the whole point of this chapter:

  • Variable dependence shows the forest’s predicted value for every observation against that observation’s actual predictor value. It is the raw, marginal picture, one point per patient, scatter and all.
  • Partial dependence averages out every other predictor to isolate the effect of one variable. It is a smoother, model-level summary, the same trend with the confounding from other variables removed.

Both come from ggRandomForests (Ehrlinger 2026): gg_variable() for the first, gg_partial_rfsrc() for the second. Each hands back a bare ggplot you finish with a house theme.

27.2 The data it needs

Like VIMP, dependence plots read from a fitted rfsrc object, not a raw data frame. We reuse the pbc survival forest from the previous chapter, fit on the primary biliary cirrhosis cohort: one row per patient, a follow-up time, an event indicator, and a panel of liver labs.

data(pbc, package = "randomForestSRC")
rf <- rfsrc(Surv(days, status) ~ ., data = pbc, ntree = 100,
            importance = TRUE)

For a survival forest there is one extra ingredient: a follow-up time at which to read the prediction. Survival probability is not a single number, it is a curve over time, so dependence is always evaluated at a chosen horizon. The median follow-up is a natural default.

t_eval <- median(pbc$days)
t_eval
[1] 1730

27.3 Build it

gg_variable() returns one predicted survival probability per observation. For a survival forest the plot() method needs both a time to evaluate at and an xvar selection naming which predictors to draw; without a time point it has nothing to put on the y-axis. Setting panel = TRUE lays the chosen predictors out side by side. Each point is one patient and the blue curve is a loess smooth through the cloud.

gg_v <- gg_variable(rf)
plot(gg_v, xvar = c("bili", "age", "albumin"), time = t_eval,
     panel = TRUE) +
  theme_hv_manuscript()
Figure 27.1: Variable dependence for bilirubin, age, and albumin: predicted survival at the median follow-up against each patient’s predictor value, with a loess smooth through the cloud

Reading the smooths: predicted survival at 1730 days falls steeply as serum bilirubin rises, declines gently with age, and tracks upward with higher albumin. These are the marginal trends actually present in the data, scatter included, and that scatter is information: it warns you how much patient-to-patient spread underlies each trend.

27.4 Read it

The two plots answer different questions, and reading them well means keeping the difference straight.

  • Variable dependence shows the raw trend, confounding and all. The bilirubin panel in Figure 27.1 mixes bilirubin’s own effect with the fact that high-bilirubin patients also tend to differ in age, albumin, and the rest. The scatter tells you how tight the relationship is once you account for that spread.
  • Partial dependence shows the adjusted trend. The partial-dependence plot
    1. holds the other predictors at their observed distribution and varies bilirubin alone, so the curve is bilirubin’s marginal effect with the other variables averaged out. When the two plots disagree, the gap is the confounding.
  • The shape is the message. Look for whether the curve is monotone (steady in one direction), thresholded (flat then steep, or steep then flat), or genuinely flat (the variable has little marginal pull at this time point even if VIMP ranked it). Report the time you read it at, because a different horizon can give a different shape.

To see partial dependence, gg_partial_rfsrc() drives randomForestSRC’s partial-prediction machinery directly. For a grid of bili values it predicts every patient as if they had that value, then averages, giving the marginal effect of bili with everything else held at its observed distribution. This is the slow step, so we restrict it to a single variable at the median time and a modest evaluation grid (n_eval) to keep the render quick.

gg_p <- gg_partial_rfsrc(rf, xvar.names = "bili",
                         partial.time = t_eval, n_eval = 25)
plot(gg_p) + geom_point() + theme_hv_manuscript()
Figure 27.2: Partial dependence for bilirubin at the median follow-up: the adjusted dose-response curve with the other predictors averaged out

The partial-dependence curve is the adjusted version of the bilirubin panel in Figure 27.1: the same downward trend, but now free of the influence of age, albumin, and the rest. The flat scatter has collapsed into a single, interpretable dose-response line, the cleanest summary of how bilirubin moves predicted survival.

27.5 Variations

27.5.1 A different follow-up time

Because survival dependence is read at a horizon, the natural variation is to read it at another one. Swap t_eval for an early landmark and the same variable can show a flatter or steeper effect, since the predictors that drive early events are not always the ones that drive late ones. Whenever you change the time, change the caption too.

27.5.2 Lowering the cost of partial dependence

Partial dependence refits predictions across the whole grid, so it scales with n_eval, the number of variables, and ntree. If a figure is too slow, lower n_eval first (a coarser grid still shows the shape), then restrict to the one or two variables you actually need to discuss, and only then reduce ntree.

27.6 Pitfalls

  • Survival dependence needs a time and an xvar. On a survival forest, gg_variable() has nothing to plot until you tell plot() which time point to evaluate and (with panel = TRUE) which predictors to draw. Omit the time and the call fails to produce a figure, because predicted survival is undefined without a horizon.
  • Pick the time deliberately and report it. Changing time or partial.time can change the shape of the curve. Choose a clinically meaningful horizon, state it, and do not quietly compare a figure read at the median with one read at five years.
  • Order of operations. Use VIMP to choose which variables to profile here. There is no point spending a slow partial-dependence run on a variable the forest already told you it ignores.