Generate Synthetic Cardiac Surgery Survival Data
Source:R/generate_survival_data.R
generate_survival_data.RdCreates a simulated dataset that mimics a cardiac surgery cohort, suitable
for testing survival analysis workflows in rf.survival.study.qmd.
Variable labels are attached via labelled::var_label() so they are
compatible with haven and hvtiRutilities::label_map().
Arguments
- n
Integer. Number of patients to simulate. Default
500.- seed
Integer. Random seed passed to
set.seedfor reproducibility. Default1024.
Value
A data.frame with n rows and 24 columns:
- ccfid
Character. Patient identifier (
"PT00001", …).- origin_year
Integer. Calendar year corresponding to
iv_opyrs = 0(randomly sampled between 1998 and 2018).- iv_opyrs
Numeric. Observation interval length in years measured from
origin_year; always greater than or equal toiv_dead.- iv_dead
Numeric. Observed follow-up time in years (min of true survival time and administrative censoring time).
- dead
Integer. Event indicator:
1= death,0= censored.- reop
Integer. Reoperation indicator:
1= yes,0= no.- iv_reop
Numeric. Time to reoperation (years);
NAif no reoperation.- age
Numeric. Age at surgery in years (range 1–85).
- sex
Factor.
"Male"/"Female"(55 / 45 %).- bmi
Numeric. Body mass index in kg/m² (range 15–50).
- hgb_bs
Numeric. Baseline hemoglobin in g/dL (range 6–18).
- wbc_bs
Numeric. Baseline WBC count in K/µL (range 1.5–20).
- plate_bs
Numeric. Baseline platelet count in K/µL (range 50–500).
- gfr_bs
Numeric. Baseline eGFR in mL/min/1.73 m² (range 10–120).
- lvefvs_b
Numeric. Baseline LV ejection fraction in % (range 15–75).
- lvmass_b
Numeric. Baseline LV mass in g (range 60–400).
- lvmsi_b
Numeric. Baseline LV mass index in g/m² (range 40–220); derived as
lvmass_b / (bmi * 0.5).- stvoli_b
Numeric. Baseline systolic stroke-volume index in mL/m² (range 20–110).
- stvold_b
Numeric. Baseline diastolic stroke-volume index in mL/m² (range 40–180).
- bypass_time
Numeric. Cardiopulmonary bypass time in minutes (range 20–240).
- xclamp_time
Numeric. Aortic cross-clamp time in minutes; sampled as 50–75 % of
bypass_time.- nyha_class
Ordered factor. NYHA functional class I–IV.
- diabetes
Factor. Diabetes mellitus:
"No"/"Yes"(75 / 25 %).- hypertension
Factor. Hypertension:
"No"/"Yes"(55 / 45 %).
Details
Survival times are drawn from a Weibull distribution (shape = 1.5, ~8-year median follow-up) with a linear predictor built from LVEF, age, baseline hemoglobin, NYHA class, and eGFR. Administrative censoring is applied uniformly between 1 and 15 years. Reoperation probability is modelled via a logistic function of bypass time and NYHA class.
The returned data frame includes the five columns excluded from
model_data in the template (ccfid, reop,
iv_reop, origin_year, iv_opyrs), plus the survival
outcome (iv_dead, dead) and 17 clinical predictors spanning
demographics, pre-operative labs,
cardiac function, and surgical variables.
The global RNG state is saved and restored on exit, so calling this function does not affect subsequent random number generation in the session.
Examples
dta <- generate_survival_data(n = 200, seed = 42)
str(dta)
#> 'data.frame': 200 obs. of 24 variables:
#> $ ccfid : chr "PT00001" "PT00002" "PT00003" "PT00004" ...
#> ..- attr(*, "label")= chr "Patient ID"
#> $ origin_year : int 2018 2006 1999 2014 2009 2011 1999 2000 2017 2015 ...
#> ..- attr(*, "label")= chr "Calendar year for iv_opyrs = 0"
#> $ iv_opyrs : num 3.77 3.99 8.64 13.54 2.28 ...
#> ..- attr(*, "label")= chr "Observation interval (years) since origin_year"
#> $ iv_dead : num 3.77 2.76 8.64 9.16 2.28 3.72 0.77 5.65 1.09 5.97 ...
#> ..- attr(*, "label")= chr "Follow-up time to death (years)"
#> $ dead : int 0 1 0 1 0 0 1 0 0 1 ...
#> ..- attr(*, "label")= chr "Death indicator (1=dead, 0=censored)"
#> $ reop : int 0 1 0 0 0 1 1 0 0 0 ...
#> ..- attr(*, "label")= chr "Reoperation (1=yes, 0=no)"
#> $ iv_reop : num NA 1.81 NA NA NA 1.29 0.72 NA NA NA ...
#> ..- attr(*, "label")= chr "Follow-up time to reoperation (years)"
#> $ age : num 65.6 36.5 50.4 54.5 51.1 43.4 67.7 43.6 75.3 44.1 ...
#> ..- attr(*, "label")= chr "Age at surgery (years)"
#> $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
#> ..- attr(*, "label")= chr "Sex"
#> $ bmi : num 27 30.8 27.2 30.7 26.3 26.7 29.4 32 20.8 26.8 ...
#> ..- attr(*, "label")= chr "Body mass index (kg/m2)"
#> $ hgb_bs : num 14.9 14.6 13 13.2 11.7 12.6 11.1 11.3 10.8 14.5 ...
#> ..- attr(*, "label")= chr "Baseline hemoglobin (g/dL)"
#> $ wbc_bs : num 8.24 8.48 5 6.69 4.98 5.91 4.48 4.71 9.07 6.82 ...
#> ..- attr(*, "label")= chr "Baseline WBC count (K/uL)"
#> $ plate_bs : num 281 204 226 192 362 50 225 227 234 119 ...
#> ..- attr(*, "label")= chr "Baseline platelet count (K/uL)"
#> $ gfr_bs : num 62 54.9 64.3 72.8 87 83.3 72.9 57.9 97.5 93.3 ...
#> ..- attr(*, "label")= chr "Baseline eGFR (mL/min/1.73m2)"
#> $ lvefvs_b : num 43.1 48.4 65.9 60.1 53.6 53.9 62.5 52.8 55.7 38.5 ...
#> ..- attr(*, "label")= chr "Baseline LV ejection fraction (%)"
#> $ lvmass_b : num 147 172 126 226 136 ...
#> ..- attr(*, "label")= chr "Baseline LV mass (g)"
#> $ lvmsi_b : num 40 40 40 40 40 40 40 40 40 40 ...
#> ..- attr(*, "label")= chr "Baseline LV mass index (g/m2)"
#> $ stvoli_b : num 76.7 53.7 31.4 74.4 56.3 69.5 84.1 71.2 49.7 53 ...
#> ..- attr(*, "label")= chr "Baseline SV index - systolic (mL/m2)"
#> $ stvold_b : num 97.4 131.6 119.8 130.4 61.6 ...
#> ..- attr(*, "label")= chr "Baseline SV index - diastolic (mL/m2)"
#> $ bypass_time : num 72 118 88 62 71 73 96 106 77 80 ...
#> ..- attr(*, "label")= chr "Cardiopulmonary bypass time (min)"
#> $ xclamp_time : num 50 76 64 34 40 47 62 69 41 56 ...
#> ..- attr(*, "label")= chr "Aortic cross-clamp time (min)"
#> $ nyha_class : Ord.factor w/ 4 levels "I"<"II"<"III"<..: 4 2 2 3 3 3 2 2 3 3 ...
#> ..- attr(*, "label")= chr "NYHA functional class"
#> $ diabetes : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 2 1 1 ...
#> ..- attr(*, "label")= chr "Diabetes mellitus"
#> $ hypertension: Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 1 2 1 ...
#> ..- attr(*, "label")= chr "Hypertension"
table(dta$dead) # event rate
#>
#> 0 1
#> 92 108
summary(dta$iv_dead) # follow-up distribution
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.250 2.440 4.100 4.888 6.612 13.980
# Drop admin columns to match model_data in the Quarto template
model_data <- dta[, !names(dta) %in% c("ccfid", "reop", "iv_reop",
"origin_year", "iv_opyrs")]