Tidy data from a varPro isolation-forest fit

Pulls per-observation anomaly scores out of a isopro fit so you can plot them, sort them, or write them to disk without having to know the internal shape of the fit.

Usage

gg_isopro(object, ..., newdata = NULL)

Arguments

object: An isopro fit returned by isopro.
...: Currently unused. Present before newdata so that newdata is only matched by name, preserving backward compatibility with callers of the PR #94 signature gg_isopro(object, ...).
newdata: Optional data.frame of new observations to score against the fit. Must be passed by name. When NULL (default) the extractor returns the in-sample tidy frame from the fit's stored $case.depth and $howbad. When supplied, each row is scored via predict.isopro and the same tidy shape is returned for the test data.

Value

A data.frame of class c("gg_isopro", "data.frame"), one row per observation. Columns:

obs: Integer; observation index 1..n, in the same order as the rows of the data passed to isopro.
case.depth: Numeric; mean isolation depth across the forest. Lower means the observation was isolated quickly, so more anomalous.
howbad: Numeric in [0, 1]; the case.depth values pushed through their own empirical CDF and flipped so higher means more anomalous. This is the score the plot method draws by default.

A provenance attribute records source = "varPro::isopro", the observation count n, and the number of trees ntree.

What isopro is doing

An isolation forest (Liu, Ting and Zhou 2008) is a random forest grown on very small subsamples of the data and asked to split until each observation lands in its own terminal node. The intuition is geometric: a typical observation sits in the dense middle of the feature cloud and takes many splits to isolate, while an unusual observation sits out near an edge and gets cut off after only a few. So the depth at which an observation is isolated is a proxy for how typical it is: shallow depth means anomalous, deep depth means ordinary. Average a single observation's depth across many trees and the noise washes out, leaving a stable per-observation rank.

isopro supports three flavours of isolation forest, which differ in how the splits are chosen:

"rnd": The original Liu/Ting/Zhou method: each tree node picks a variable at random and a split point uniformly at random in the variable's range. Fast, no model, surprisingly effective.
"unsupv": Unsupervised splitting from randomForestSRC: splits are chosen to separate the data along the directions of highest variance. More structured than "rnd"; sometimes more accurate, especially when the anomalies follow a coherent direction.
"auto": An auto-encoder formulation that grows a multivariate forest predicting each feature from the others. Most expressive, slowest, best suited to low-dimensional data.

No method is universally best. The varPro authors recommend trying at least two and comparing the score distributions; the plot method here colors per-method curves automatically when you stack the results.

What's in the output

The fit gives back two parallel per-observation vectors: case.depth is the raw mean isolation depth (units of "splits", lower = more anomalous) and howbad is the same information on a [0, 1] scale, the empirical CDF of case.depth. On the fit those two run the same way: varPro::isopro()'s howbad is lower = more anomalous, like the depth it came from.

We flip it. gg_isopro() returns 1 - object$howbad, so the column you get is higher = more anomalous and reads the way a score should. That is our transformation, not the fit's, and it means gg$howbad is 1 - fit$howbad rather than a copy of it – worth knowing if you compare the two side by side. Both columns are kept so you can plot in either space and have the raw depth on hand for diagnostics; howbad is the canonical score and is what the plot method uses by default.

What you use this for

This is screening, not inference. Reach for it when you want to:

flag observations that may be data-entry errors, out-of-range measurements, or distinct subpopulations before fitting a primary model;
check whether a held-out cohort sits inside the training distribution before scoring with a model trained elsewhere;
give the analyst a ranked list of "look at these first" cases for a manual review;
score a held-out cohort or a fresh batch of incoming data against a fitted model and compare the test scores to the training distribution.

The score is a rank, not a probability of being an outlier: two observations with howbad = 0.92 are both unusual, not "92\ likely to be anomalous". Pick a cutoff by looking at where the elbow rises; plot.gg_isopro can annotate either a score (threshold) or a top-percent (top_n_pct) for you.

Scoring new data

Pass a data.frame as newdata and the extractor calls predict.isopro twice: once with quantiles = FALSE to get the raw mean case depth per row, and once with quantiles = TRUE to get the per-row quantile of that depth against the training-data depth distribution.

varPro's predict.isopro returns quantiles where smaller is more anomalous, which is the opposite polarity of the wrapper's howbad (where higher is more anomalous). The wrapper exposes both conventions so nothing is hidden:

case.depth carries varPro's native polarity, lower = more anomalous. This is the unmodified output of predict(object, newdata, quantiles = FALSE). Use it to cross-reference against raw varPro output.
howbad is the flipped, wrapper-convention version. The relationship is howbad = 1 - predict(object, newdata, quantiles = TRUE).

To overlay training and test scores in one plot, bind the two extractor calls with a method label column (the same column plot.gg_isopro uses to color rnd / unsupv / auto comparisons):


gg_train <- gg_isopro(fit)
gg_test  <- gg_isopro(fit, newdata = test_df)
gg_both  <- rbind(cbind(gg_train, method = "train"),
                  cbind(gg_test,  method = "test"))
class(gg_both) <- c("gg_isopro", "data.frame")
plot(gg_both)

Comparing methods

To compare methods ("rnd", "unsupv", "auto"), call gg_isopro on each fit and dplyr::bind_rows() the results with a method label column. The plot method auto-detects method and colors the curves.

References

Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008). Isolation Forest. Eighth IEEE International Conference on Data Mining, 413-422.

Ishwaran, H., Mantero, A., and Lu, M. (2025). varPro: Model-Independent Variable Selection via the Rule-Based Variable Priority Framework. R package version 3.x.

Examples

# \donttest{
if (requireNamespace("varPro", quietly = TRUE)) {
  set.seed(1)
  fit <- varPro::isopro(data = iris[, 1:4], method = "rnd",
                        sampsize = 32, ntree = 50)
  gg <- gg_isopro(fit)
  plot(gg)
}

# }