Prediction for Boosted multivariate trees for longitudinal data.
predict.boostmtree.Rd
Obtain predicted values. Also returns test-set performance if the test data contains y-outcomes.
Usage
# S3 method for class 'boostmtree'
predict(object,
x,
tm,
id,
y,
M,
eps = 1e-5,
useCVflag = FALSE,
...)
Arguments
- object
A boosting object of class
(boostmtree, grow)
.- x
Data frame (or matrix) containing test set x-values. Rows must be duplicated to match the number of time points for an individual. If missing, the training x values are used and
tm
,id
andy
are not required and no performance values are returned.- tm
Time values for each test set individual with one entry for each row of
x
. Optional, but if missing, the set of unique time values from the training values are used for each individual and no test-set performance values are returned.- id
Unique subject identifier, one entry for each row in
x
. Optional, but if missing, each individual is assumed to have a full time-profile specified by the unique time values from the training data.- y
Test set y-values, with one entry for each row in
x
.- M
Fixed value for the boosting step number. Leave this empty to determine the optimized value obtained by minimizing test-set error.
- eps
Tolerance value used for determining the optimal
M
. For experts only.- useCVflag
Should the predicted value be based on the estimate derived from oob sample?
- ...
Further arguments passed to or from other methods.
Details
The predicted time profile and performance values are obtained for test data from the boosted object grown on the training data.
R-side parallel processing is implemented by replacing the R function
lapply
with mclapply
found in the parallel
package. You can set the number of cores accessed by
mclapply
by issuing the command options(mc.cores =
x)
, where x
is the number of cores. As an example, issuing
the following options command uses all available cores:
options(mc.cores=detectCores())
However, this can create high RAM usage, especially when using
function partialPlot
which calls the predict
function.
Note that all performance values (for example prediction error) are standardized by the overall y-standard deviation. Thus, reported RMSE (root-mean-squared-error) is actually standardized RMSE. Values are reported at the optimal stopping time.
Value
An object of class (boostmtree, predict)
, which is a list with the
following components:
- boost.obj
The original boosting object.
- x
The test x-values, but with only one row per individual (i.e. duplicated rows are removed).
- time
List with each component containing the time points for a given test individual.
- id
Sorted subject identifier.
- y
List containing the test y-values.
- Y
y-values, in the list-format, where nominal or ordinal Response is converted into the binary response.
- family
Family of
y
.- ymean
Overall mean of y-values for all individuals. If
family
= "Binary", "Nominal" or "Ordinal",ymean
= 0.- ysd
Overall standard deviation of y-values for all individuals. If
family
= "Binary", "Nominal" or "Ordinal",ysd
= 1.- xvar.names
X-variable names.
- K
Number of terminal nodes.
- n
Total number of subjects.
- ni
Number of repeated measures for each subject.
- n.Q
Number of class labels for non-continuous response.
- Q_set
Class labels for the non-continuous response.
- y.unq
Unique y values for the non-continous response.
- nu
Boosting regularization parameter.
- D
Design matrix for each subject.
- df.D
Number of columns of
D
.- time.unq
Vector of the unique time points.
- baselearner
List of length M containing the base learners.
- gamma
List of length M, with each component containing the boosted tree fitted values.
- membership
List of length M, with each component containing the terminal node membership for a given boosting iteration.
- mu
Estimated mean profile at the optimized
M
.- Prob_class
For family == "Ordinal", this provides individual probabilty rather than cumulative probabilty.
- muhat
Extrapolated mean profile to all unique time points evaluated at the the optimized
M
.- Prob_hat_class
Extrapolated
Prob_class
to all unique time points evaluated at the the optimizedM
.- err.rate
Test set standardized l1-error and RMSE.
- rmse
Test set standardized RMSE at the optimized
M
.- Mopt
The optimized
M
.
References
Pande A., Li L., Rajeswaran J., Ehrlinger J., Kogalur U.B., Blackstone E.H., Ishwaran H. (2017). Boosted multivariate trees for longitudinal data, Machine Learning, 106(2): 277–305.
Examples
if (FALSE) { # \dontrun{
##------------------------------------------------------------
## Synthetic example (Response is continuous)
##
## High correlation, quadratic time with quadratic interaction
## largish number of noisy variables
##
## Illustrates how modified gradient improves performance
## also compares performance to ideal and well specified linear models
##----------------------------------------------------------------------------
## simulate the data
## simulation 2: main effects (x1, x3, x4), quad-time-interaction (x2)
dtaO <- simLong(n = 100, ntest = 100, model = 2, family = "Continuous", q = 25)
## save the data as both a list and data frame
dtaL <- dtaO$dtaL
dta <- dtaO$dta
## get the training data
trn <- dtaO$trn
## save formulas for linear model comparisons
f.true <- dtaO$f.true
f.linr <- "y~g( x1+x2+x3+x4+x1*time+x2*time+x3*time+x4*time )"
## modified tree gradient (default)
o.1 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn],dtaL$y[trn],
family = "Continuous",M = 350)
p.1 <- predict(o.1, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
## non-modified tree gradient (nmtg)
o.2 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn], dtaL$y[trn],
family = "Continuous",M = 350, mod.grad = FALSE)
p.2 <- predict(o.2, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
## set rho = 0
o.3 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn], dtaL$y[trn],
family = "Continuous",M = 350, rho = 0)
p.3 <- predict(o.3, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
##rmse values compared to generalized least squares (GLS)
##for true model and well specified linear models (LM)
cat("true LM :", boostmtree:::gls.rmse(f.true,dta,trn),"\n")
cat("well specified LM :", boostmtree:::gls.rmse(f.linr,dta,trn),"\n")
cat("boostmtree :", p.1$rmse,"\n")
cat("boostmtree (nmtg):", p.2$rmse,"\n")
cat("boostmtree (rho=0):", p.3$rmse,"\n")
##predicted value plots
plot(p.1)
plot(p.2)
plot(p.3)
##------------------------------------------------------------
## Synthetic example (Response is binary)
##
## High correlation, quadratic time with quadratic interaction
## largish number of noisy variables
##----------------------------------------------------------------------------
## simulate the data
## simulation 2: main effects (x1, x3, x4), quad-time-interaction (x2)
dtaO <- simLong(n = 100, ntest = 100, model = 2, family = "Binary", q = 25)
## save the data as both a list and data frame
dtaL <- dtaO$dtaL
dta <- dtaO$dta
## get the training data
trn <- dtaO$trn
## save formulas for linear model comparisons
f.true <- dtaO$f.true
f.linr <- "y~g( x1+x2+x3+x4+x1*time+x2*time+x3*time+x4*time )"
## modified tree gradient (default)
o.1 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn],dtaL$y[trn],
family = "Binary",M = 350)
p.1 <- predict(o.1, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
} # }