Prediction for Boosted multivariate trees for longitudinal data.
predict.boostmtree.RdObtain predicted values. Also returns test-set performance if the test data contains y-outcomes.
Usage
# S3 method for class 'boostmtree'
predict(object,
x,
tm,
id,
y,
M,
eps = 1e-5,
useCVflag = FALSE,
...)Arguments
- object
A boosting object of class
(boostmtree, grow).- x
Data frame (or matrix) containing test set x-values. Rows must be duplicated to match the number of time points for an individual. If missing, the training x values are used and
tm,idandyare not required and no performance values are returned.- tm
Time values for each test set individual with one entry for each row of
x. Optional, but if missing, the set of unique time values from the training values are used for each individual and no test-set performance values are returned.- id
Unique subject identifier, one entry for each row in
x. Optional, but if missing, each individual is assumed to have a full time-profile specified by the unique time values from the training data.- y
Test set y-values, with one entry for each row in
x.- M
Fixed value for the boosting step number. Leave this empty to determine the optimized value obtained by minimizing test-set error.
- eps
Tolerance value used for determining the optimal
M. For experts only.- useCVflag
Should the predicted value be based on the estimate derived from oob sample?
- ...
Further arguments passed to or from other methods.
Details
The predicted time profile and performance values are obtained for test data from the boosted object grown on the training data.
R-side parallel processing is implemented by replacing the R function
lapply with mclapply found in the parallel
package. You can set the number of cores accessed by
mclapply by issuing the command options(mc.cores =
x), where x is the number of cores. As an example, issuing
the following options command uses all available cores:
options(mc.cores=detectCores())
However, this can create high RAM usage, especially when using
function partialPlot which calls the predict
function.
Note that all performance values (for example prediction error) are standardized by the overall y-standard deviation. Thus, reported RMSE (root-mean-squared-error) is actually standardized RMSE. Values are reported at the optimal stopping time.
Value
An object of class (boostmtree, predict), which is a list with the
following components:
- boost.obj
The original boosting object.
- x
The test x-values, but with only one row per individual (i.e. duplicated rows are removed).
- time
List with each component containing the time points for a given test individual.
- id
Sorted subject identifier.
- y
List containing the test y-values.
- Y
y-values, in the list-format, where nominal or ordinal Response is converted into the binary response.
- family
Family of
y.- ymean
Overall mean of y-values for all individuals. If
family= "Binary", "Nominal" or "Ordinal",ymean= 0.- ysd
Overall standard deviation of y-values for all individuals. If
family= "Binary", "Nominal" or "Ordinal",ysd= 1.- xvar.names
X-variable names.
- K
Number of terminal nodes.
- n
Total number of subjects.
- ni
Number of repeated measures for each subject.
- n.Q
Number of class labels for non-continuous response.
- Q_set
Class labels for the non-continuous response.
- y.unq
Unique y values for the non-continous response.
- nu
Boosting regularization parameter.
- D
Design matrix for each subject.
- df.D
Number of columns of
D.- time.unq
Vector of the unique time points.
- baselearner
List of length M containing the base learners.
- gamma
List of length M, with each component containing the boosted tree fitted values.
- membership
List of length M, with each component containing the terminal node membership for a given boosting iteration.
- mu
Estimated mean profile at the optimized
M.- Prob_class
For family == "Ordinal", this provides individual probabilty rather than cumulative probabilty.
- muhat
Extrapolated mean profile to all unique time points evaluated at the the optimized
M.- Prob_hat_class
Extrapolated
Prob_classto all unique time points evaluated at the the optimizedM.- err.rate
Test set standardized l1-error and RMSE.
- rmse
Test set standardized RMSE at the optimized
M.- Mopt
The optimized
M.
References
Pande A., Li L., Rajeswaran J., Ehrlinger J., Kogalur U.B., Blackstone E.H., Ishwaran H. (2017). Boosted multivariate trees for longitudinal data, Machine Learning, 106(2): 277–305.
Examples
if (FALSE) { # \dontrun{
##------------------------------------------------------------
## Synthetic example (Response is continuous)
##
## High correlation, quadratic time with quadratic interaction
## largish number of noisy variables
##
## Illustrates how modified gradient improves performance
## also compares performance to ideal and well specified linear models
##----------------------------------------------------------------------------
## simulate the data
## simulation 2: main effects (x1, x3, x4), quad-time-interaction (x2)
dtaO <- simLong(n = 100, ntest = 100, model = 2, family = "Continuous", q = 25)
## save the data as both a list and data frame
dtaL <- dtaO$dtaL
dta <- dtaO$dta
## get the training data
trn <- dtaO$trn
## save formulas for linear model comparisons
f.true <- dtaO$f.true
f.linr <- "y~g( x1+x2+x3+x4+x1*time+x2*time+x3*time+x4*time )"
## modified tree gradient (default)
o.1 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn],dtaL$y[trn],
family = "Continuous",M = 350)
p.1 <- predict(o.1, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
## non-modified tree gradient (nmtg)
o.2 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn], dtaL$y[trn],
family = "Continuous",M = 350, mod.grad = FALSE)
p.2 <- predict(o.2, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
## set rho = 0
o.3 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn], dtaL$y[trn],
family = "Continuous",M = 350, rho = 0)
p.3 <- predict(o.3, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
##rmse values compared to generalized least squares (GLS)
##for true model and well specified linear models (LM)
cat("true LM :", boostmtree:::gls.rmse(f.true,dta,trn),"\n")
cat("well specified LM :", boostmtree:::gls.rmse(f.linr,dta,trn),"\n")
cat("boostmtree :", p.1$rmse,"\n")
cat("boostmtree (nmtg):", p.2$rmse,"\n")
cat("boostmtree (rho=0):", p.3$rmse,"\n")
##predicted value plots
plot(p.1)
plot(p.2)
plot(p.3)
##------------------------------------------------------------
## Synthetic example (Response is binary)
##
## High correlation, quadratic time with quadratic interaction
## largish number of noisy variables
##----------------------------------------------------------------------------
## simulate the data
## simulation 2: main effects (x1, x3, x4), quad-time-interaction (x2)
dtaO <- simLong(n = 100, ntest = 100, model = 2, family = "Binary", q = 25)
## save the data as both a list and data frame
dtaL <- dtaO$dtaL
dta <- dtaO$dta
## get the training data
trn <- dtaO$trn
## save formulas for linear model comparisons
f.true <- dtaO$f.true
f.linr <- "y~g( x1+x2+x3+x4+x1*time+x2*time+x3*time+x4*time )"
## modified tree gradient (default)
o.1 <- boostmtree(dtaL$features[trn, ], dtaL$time[trn], dtaL$id[trn],dtaL$y[trn],
family = "Binary",M = 350)
p.1 <- predict(o.1, dtaL$features[-trn, ], dtaL$time[-trn], dtaL$id[-trn], dtaL$y[-trn])
} # }