4  Data governance and reproducibility

Every other chapter in this book ends in a figure. This one comes first in practice, because reproducible analysis starts before the first plot. A documented dataset, typed and labelled variables, and a manifest pinning the exact extract are the difference between a figure you can defend in review and one you cannot. The hvtiRutilities (Ehrlinger 2026) package supplies the primitives for that work: synthetic data for examples, a data dictionary, label and type maps, dataset manifests, and version diffs.

4.1 When to use it

Reach for this workflow at the start of any analysis that will end up in front of a reviewer, a co-author, or your future self. The moment to run it is the moment a data extract lands: before you fit a model, before you draw a curve, while you still remember which warehouse pull this was and what date it covered. The payoff comes later, when someone asks “which version of the data is this figure built on” and you can answer with a hash instead of a shrug, or when a re-delivery arrives and you need to know in seconds whether anything moved.

None of this requires touching protected health information. Everything below runs on synthetic data, and the same helpers run unchanged against your real extract once the pipeline is wired up. Think of it as the paperwork that makes the science reproducible, kept light enough that you will actually do it.

4.2 The data it needs

The workflow needs a clinical data frame, one row per patient, with whatever columns the study carries. For the examples we generate a synthetic survival cohort so the chapter runs end to end without any real patient data. generate_survival_data() is deterministic: fix the seed and every table and figure downstream is reproducible.

dat <- generate_survival_data(n = 200, seed = 42)
str(dat)
'data.frame':   200 obs. of  24 variables:
 $ ccfid       : chr  "PT00001" "PT00002" "PT00003" "PT00004" ...
  ..- attr(*, "label")= chr "Patient ID"
 $ origin_year : int  2018 2006 1999 2014 2009 2011 1999 2000 2017 2015 ...
  ..- attr(*, "label")= chr "Calendar year for iv_opyrs = 0"
 $ iv_opyrs    : num  3.77 3.99 8.64 13.54 2.28 ...
  ..- attr(*, "label")= chr "Observation interval (years) since origin_year"
 $ iv_dead     : num  3.77 2.76 8.64 9.16 2.28 3.72 0.77 5.65 1.09 5.97 ...
  ..- attr(*, "label")= chr "Follow-up time to death (years)"
 $ dead        : int  0 1 0 1 0 0 1 0 0 1 ...
  ..- attr(*, "label")= chr "Death indicator (1=dead, 0=censored)"
 $ reop        : int  0 1 0 0 0 1 1 0 0 0 ...
  ..- attr(*, "label")= chr "Reoperation (1=yes, 0=no)"
 $ iv_reop     : num  NA 1.81 NA NA NA 1.29 0.72 NA NA NA ...
  ..- attr(*, "label")= chr "Follow-up time to reoperation (years)"
 $ age         : num  65.6 36.5 50.4 54.5 51.1 43.4 67.7 43.6 75.3 44.1 ...
  ..- attr(*, "label")= chr "Age at surgery (years)"
 $ sex         : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
  ..- attr(*, "label")= chr "Sex"
 $ bmi         : num  27 30.8 27.2 30.7 26.3 26.7 29.4 32 20.8 26.8 ...
  ..- attr(*, "label")= chr "Body mass index (kg/m2)"
 $ hgb_bs      : num  14.9 14.6 13 13.2 11.7 12.6 11.1 11.3 10.8 14.5 ...
  ..- attr(*, "label")= chr "Baseline hemoglobin (g/dL)"
 $ wbc_bs      : num  8.24 8.48 5 6.69 4.98 5.91 4.48 4.71 9.07 6.82 ...
  ..- attr(*, "label")= chr "Baseline WBC count (K/uL)"
 $ plate_bs    : num  281 204 226 192 362 50 225 227 234 119 ...
  ..- attr(*, "label")= chr "Baseline platelet count (K/uL)"
 $ gfr_bs      : num  62 54.9 64.3 72.8 87 83.3 72.9 57.9 97.5 93.3 ...
  ..- attr(*, "label")= chr "Baseline eGFR (mL/min/1.73m2)"
 $ lvefvs_b    : num  43.1 48.4 65.9 60.1 53.6 53.9 62.5 52.8 55.7 38.5 ...
  ..- attr(*, "label")= chr "Baseline LV ejection fraction (%)"
 $ lvmass_b    : num  147 172 126 226 136 ...
  ..- attr(*, "label")= chr "Baseline LV mass (g)"
 $ lvmsi_b     : num  40 40 40 40 40 40 40 40 40 40 ...
  ..- attr(*, "label")= chr "Baseline LV mass index (g/m2)"
 $ stvoli_b    : num  76.7 53.7 31.4 74.4 56.3 69.5 84.1 71.2 49.7 53 ...
  ..- attr(*, "label")= chr "Baseline SV index - systolic (mL/m2)"
 $ stvold_b    : num  97.4 131.6 119.8 130.4 61.6 ...
  ..- attr(*, "label")= chr "Baseline SV index - diastolic (mL/m2)"
 $ bypass_time : num  72 118 88 62 71 73 96 106 77 80 ...
  ..- attr(*, "label")= chr "Cardiopulmonary bypass time (min)"
 $ xclamp_time : num  50 76 64 34 40 47 62 69 41 56 ...
  ..- attr(*, "label")= chr "Aortic cross-clamp time (min)"
 $ nyha_class  : Ord.factor w/ 4 levels "I"<"II"<"III"<..: 4 2 2 3 3 3 2 2 3 3 ...
  ..- attr(*, "label")= chr "NYHA functional class"
 $ diabetes    : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 2 1 1 ...
  ..- attr(*, "label")= chr "Diabetes mellitus"
 $ hypertension: Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 1 2 1 ...
  ..- attr(*, "label")= chr "Hypertension"

Notice that each column carries a label attribute. That label is not decoration. The governance helpers below read it and carry it forward into dictionaries, tables, and figure axes, so a variable described once stays described everywhere.

4.3 Build it

The governance workflow is a short sequence of steps. Each one produces an artifact you can save next to the analysis: a dictionary, a label map, a typing check, a manifest, and a comparison report. We walk them in order.

4.3.1 Step 1: a data dictionary

A data dictionary is the human-readable index of an extract. data_dictionary() returns a per-column summary as a data frame, one row per variable, with the variable name, its label, its class, the count of distinct values, the percent missing, and a compact distribution summary. It renders straight through gt() into a table you can drop into a supplement.

dd <- data_dictionary(dat)
gt(dd) |>
  tab_header(title = "Data dictionary")
Data dictionary
variable label class n_unique pct_missing summary
ccfid Patient ID character 200 0.0 200 levels: PT00001, PT00002, PT00003, PT00004, PT00005, ...
origin_year Calendar year for iv_opyrs = 0 integer 21 0.0 1998 / 2008 / 2018
iv_opyrs Observation interval (years) since origin_year numeric 183 0.0 1.06 / 7.87 / 14.99
iv_dead Follow-up time to death (years) numeric 184 0.0 0.25 / 4.1 / 13.98
dead Death indicator (1=dead, 0=censored) integer 2 0.0 0 / 1 / 1
reop Reoperation (1=yes, 0=no) integer 2 0.0 0 / 0 / 1
iv_reop Follow-up time to reoperation (years) numeric 31 83.5 0.04 / 1.29 / 9.92
age Age at surgery (years) numeric 165 0.0 1 / 44.75 / 85
sex Sex factor 2 0.0 2 levels: Female, Male
bmi Body mass index (kg/m2) numeric 123 0.0 15 / 26.65 / 41.8
hgb_bs Baseline hemoglobin (g/dL) numeric 66 0.0 7.6 / 13 / 18
wbc_bs Baseline WBC count (K/uL) numeric 172 0.0 1.5 / 7.35 / 15.53
plate_bs Baseline platelet count (K/uL) numeric 135 0.0 50 / 225 / 447
gfr_bs Baseline eGFR (mL/min/1.73m2) numeric 174 0.0 25.9 / 75.75 / 120
lvefvs_b Baseline LV ejection fraction (%) numeric 149 0.0 29.1 / 53.75 / 75
lvmass_b Baseline LV mass (g) numeric 191 0.0 60 / 184.4 / 377.2
lvmsi_b Baseline LV mass index (g/m2) numeric 1 0.0 40 / 40 / 40
stvoli_b Baseline SV index - systolic (mL/m2) numeric 168 0.0 20.5 / 53.7 / 90.3
stvold_b Baseline SV index - diastolic (mL/m2) numeric 171 0.0 40 / 93.35 / 153.6
bypass_time Cardiopulmonary bypass time (min) numeric 100 0.0 20 / 88.5 / 194
xclamp_time Aortic cross-clamp time (min) numeric 75 0.0 13 / 57 / 118
nyha_class NYHA functional class ordered 4 0.0 4 levels: I, II, III, IV
diabetes Diabetes mellitus factor 2 0.0 2 levels: No, Yes
hypertension Hypertension factor 2 0.0 2 levels: No, Yes

4.3.2 Step 2: variable labels in one place

label_map() returns the current variable-to-label mapping as a two-column (key, label) data frame. Holding labels in one place is what keeps an axis title in a figure and a column header in a table reading the same way. Change a label here and it propagates everywhere the map is read, so you never hand-edit “Ejection Fraction” in five places and miss the sixth.

lm <- label_map(dat)
gt(head(lm, 10))
key label
ccfid Patient ID
origin_year Calendar year for iv_opyrs = 0
iv_opyrs Observation interval (years) since origin_year
iv_dead Follow-up time to death (years)
dead Death indicator (1=dead, 0=censored)
reop Reoperation (1=yes, 0=no)
iv_reop Follow-up time to reoperation (years)
age Age at surgery (years)
sex Sex
bmi Body mass index (kg/m2)

4.3.3 Step 3: infer and check R data types

Raw extracts arrive with everything as character or numeric, and the wrong type quietly breaks an analysis: an ID read as a number gets averaged, a 0/1 flag read as a count gets summed. r_data_types() classifies each column by role and returns the dataset with columns coerced to the inferred type, so character IDs become factors and 0/1 indicators become logical. Read that sentence twice: it returns the coerced data, not a report about it. To see what changed, compare the source class against the inferred class yourself.

typed <- r_data_types(dat)
data.frame(
  variable = names(dat),
  source   = vapply(dat,   \(x) class(x)[1], character(1)),
  inferred = vapply(typed, \(x) class(x)[1], character(1)),
  row.names = NULL
) |>
  subset(source != inferred) |>
  gt() |>
  tab_header(title = "Columns retyped by r_data_types()")
Columns retyped by r_data_types()
variable source inferred
ccfid character factor
dead integer logical
reop integer logical

4.3.4 Step 4: pin the extract with a manifest

A dictionary documents what the data is; a manifest documents which data it is. A manifest records provenance for every extract: the source system, the extract date, the row count, and a SHA-256 hash of the file. update_manifest() appends an entry, hashing the file on disk, and verify_manifest() re-hashes the data directory against the manifest to confirm nothing has drifted since you wrote it down. The chunk below is not evaluated because it writes files to disk; run it interactively, then commit the manifest (never the data) to the repo.

saveRDS(dat, "cohort.rds")
mp <- tempfile(fileext = ".yaml")

update_manifest(
  file         = "cohort.rds",
  manifest_path = mp,
  extract_date = Sys.Date(),
  n_rows       = nrow(dat),
  source       = "CORR DWH"
)

verify_manifest(manifest_path = mp, data_dir = ".", stop_on_error = FALSE)

A representative manifest entry looks like this. The hash is the load-bearing field: it is what lets verify_manifest() catch a file that changed under you.

datasets:
- file: cohort.rds
  extract_date: '2026-05-28'
  n_rows: 200
  sha256: 8b3e0188eae294e61a16f0e09e3989e8b5249601209bde987fb5153c9ab1bd76
  source: CORR DWH

4.3.5 Step 5: compare versions on re-delivery

Data warehouses re-deliver extracts, and a re-delivery is rarely a no-op: rows get added, a column is renamed, a coding changes. compare_datasets() diffs the old and new versions and reports row-count and structural changes through its print method. It is the fast first check that a fresh pull is what you asked for before you rerun anything on it.

old <- generate_survival_data(n = 150, seed = 42)
new <- generate_survival_data(n = 200, seed = 99)
compare_datasets(old, new)
Dataset Comparison
  Rows: 150 -> 200

4.4 Read it

Two of these artifacts are reports you read, not just files you store.

A data dictionary is read column by column, and the two columns that earn the most attention are percent-missing and distinct values. High missingness flags a variable you cannot lean on without saying so. A distinct-value count that surprises you, two levels where you expected continuous, a thousand where you expected a category, usually means a typing or coding problem upstream, caught here before it reaches a model. The distribution summary is your sanity check on range: a negative age or an ejection fraction above 100 shows up immediately.

A dataset-comparison report is read for the diff, not the totals. A change in row count is expected on re-delivery and is fine if it matches what the warehouse told you to expect. The lines that should make you stop are structural ones: a column that appeared, disappeared, or changed type. Those silently break joins and recodes downstream, so confirm each one is intentional before you build anything on the new extract.

4.5 Pitfalls

  • Unpinned extracts. A figure built on “the data I pulled sometime last month” is not reproducible. Write a manifest the day the extract lands, commit it, and reference its hash from the analysis. The hash is cheap insurance against a silently swapped file.
  • Silent type coercion. r_data_types() returns the coerced data frame, not a summary of what it did. If you treat its output as a report you will miss that it changed your data. Diff source class against inferred class, as in step 3, and keep the typed object as the one you analyse.
  • Manifests not re-verified after a re-delivery. Writing a manifest once is not enough. When a new extract arrives, run verify_manifest() and compare_datasets() again. A manifest that nobody re-checks is documentation, not a guard rail.
  • PHI in the repo. Commit the manifest, the dictionary, and the code; never commit the data file itself. The whole point of the hash is to pin the extract without carrying protected health information into version control. Examples in a book stay on synthetic data for the same reason.