4 Data governance and reproducibility

Every other chapter in this book ends in a figure. This one comes first in practice, because reproducible analysis starts before the first plot. A documented dataset, typed and labelled variables, and a manifest pinning the exact extract are the difference between a figure you can defend in review and one you cannot. The hvtiRutilities (Ehrlinger 2026) package supplies the primitives for that work: synthetic data for examples, a data dictionary, label and type maps, dataset manifests, and version diffs.

4.1 When to use it

Reach for this workflow at the start of any analysis that will end up in front of a reviewer, a co-author, or your future self. The moment to run it is the moment a data extract lands: before you fit a model, before you draw a curve, while you still remember which warehouse pull this was and what date it covered. The payoff comes later, when someone asks “which version of the data is this figure built on” and you can answer with a hash instead of a shrug, or when a re-delivery arrives and you need to know in seconds whether anything moved.

None of this requires touching protected health information. Everything below runs on synthetic data, and the same helpers run unchanged against your real extract once the pipeline is wired up. Think of it as the paperwork that makes the science reproducible, kept light enough that you will actually do it.

4.2 The data it needs

The workflow needs a clinical data frame, one row per patient, with whatever columns the study carries. For the examples we generate a synthetic survival cohort so the chapter runs end to end without any real patient data. generate_survival_data() is deterministic: fix the seed and every table and figure downstream is reproducible.

dat <- generate_survival_data(n = 200, seed = 42)
str(dat)

'data.frame':   200 obs. of  24 variables:
 $ ccfid       : chr  "PT00001" "PT00002" "PT00003" "PT00004" ...
  ..- attr(*, "label")= chr "Patient ID"
 $ origin_year : int  2018 2006 1999 2014 2009 2011 1999 2000 2017 2015 ...
  ..- attr(*, "label")= chr "Calendar year for iv_opyrs = 0"
 $ iv_opyrs    : num  3.77 3.99 8.64 13.54 2.28 ...
  ..- attr(*, "label")= chr "Observation interval (years) since origin_year"
 $ iv_dead     : num  3.77 2.76 8.64 9.16 2.28 3.72 0.77 5.65 1.09 5.97 ...
  ..- attr(*, "label")= chr "Follow-up time to death (years)"
 $ dead        : int  0 1 0 1 0 0 1 0 0 1 ...
  ..- attr(*, "label")= chr "Death indicator (1=dead, 0=censored)"
 $ reop        : int  0 1 0 0 0 1 1 0 0 0 ...
  ..- attr(*, "label")= chr "Reoperation (1=yes, 0=no)"
 $ iv_reop     : num  NA 1.81 NA NA NA 1.29 0.72 NA NA NA ...
  ..- attr(*, "label")= chr "Follow-up time to reoperation (years)"
 $ age         : num  65.6 36.5 50.4 54.5 51.1 43.4 67.7 43.6 75.3 44.1 ...
  ..- attr(*, "label")= chr "Age at surgery (years)"
 $ sex         : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
  ..- attr(*, "label")= chr "Sex"
 $ bmi         : num  27 30.8 27.2 30.7 26.3 26.7 29.4 32 20.8 26.8 ...
  ..- attr(*, "label")= chr "Body mass index (kg/m2)"
 $ hgb_bs      : num  14.9 14.6 13 13.2 11.7 12.6 11.1 11.3 10.8 14.5 ...
  ..- attr(*, "label")= chr "Baseline hemoglobin (g/dL)"
 $ wbc_bs      : num  8.24 8.48 5 6.69 4.98 5.91 4.48 4.71 9.07 6.82 ...
  ..- attr(*, "label")= chr "Baseline WBC count (K/uL)"
 $ plate_bs    : num  281 204 226 192 362 50 225 227 234 119 ...
  ..- attr(*, "label")= chr "Baseline platelet count (K/uL)"
 $ gfr_bs      : num  62 54.9 64.3 72.8 87 83.3 72.9 57.9 97.5 93.3 ...
  ..- attr(*, "label")= chr "Baseline eGFR (mL/min/1.73m2)"
 $ lvefvs_b    : num  43.1 48.4 65.9 60.1 53.6 53.9 62.5 52.8 55.7 38.5 ...
  ..- attr(*, "label")= chr "Baseline LV ejection fraction (%)"
 $ lvmass_b    : num  147 172 126 226 136 ...
  ..- attr(*, "label")= chr "Baseline LV mass (g)"
 $ lvmsi_b     : num  40 40 40 40 40 40 40 40 40 40 ...
  ..- attr(*, "label")= chr "Baseline LV mass index (g/m2)"
 $ stvoli_b    : num  76.7 53.7 31.4 74.4 56.3 69.5 84.1 71.2 49.7 53 ...
  ..- attr(*, "label")= chr "Baseline SV index - systolic (mL/m2)"
 $ stvold_b    : num  97.4 131.6 119.8 130.4 61.6 ...
  ..- attr(*, "label")= chr "Baseline SV index - diastolic (mL/m2)"
 $ bypass_time : num  72 118 88 62 71 73 96 106 77 80 ...
  ..- attr(*, "label")= chr "Cardiopulmonary bypass time (min)"
 $ xclamp_time : num  50 76 64 34 40 47 62 69 41 56 ...
  ..- attr(*, "label")= chr "Aortic cross-clamp time (min)"
 $ nyha_class  : Ord.factor w/ 4 levels "I"<"II"<"III"<..: 4 2 2 3 3 3 2 2 3 3 ...
  ..- attr(*, "label")= chr "NYHA functional class"
 $ diabetes    : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 2 1 1 ...
  ..- attr(*, "label")= chr "Diabetes mellitus"
 $ hypertension: Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 1 2 1 ...
  ..- attr(*, "label")= chr "Hypertension"

Notice that each column carries a label attribute. That label is not decoration. The governance helpers below read it and carry it forward into dictionaries, tables, and figure axes, so a variable described once stays described everywhere.

4.3 Build it

The governance workflow is a short sequence of steps. Each one produces an artifact you can save next to the analysis: a dictionary, a label map, a typing check, a manifest, and a comparison report. We walk them in order.

4.3.1 Step 1: a data dictionary

A data dictionary is the human-readable index of an extract. data_dictionary() returns a per-column summary as a data frame, one row per variable, with the variable name, its label, its class, the count of distinct values, the percent missing, and a compact distribution summary. It renders straight through gt() into a table you can drop into a supplement.

dd <- data_dictionary(dat)
gt(dd) |>
  tab_header(title = "Data dictionary")

variable	label	class	n_unique	pct_missing	summary
Data dictionary
ccfid	Patient ID	character	200	0.0	200 levels: PT00001, PT00002, PT00003, PT00004, PT00005, ...
origin_year	Calendar year for iv_opyrs = 0	integer	21	0.0	1998 / 2008 / 2018
iv_opyrs	Observation interval (years) since origin_year	numeric	183	0.0	1.06 / 7.87 / 14.99
iv_dead	Follow-up time to death (years)	numeric	184	0.0	0.25 / 4.1 / 13.98
dead	Death indicator (1=dead, 0=censored)	integer	2	0.0	0 / 1 / 1
reop	Reoperation (1=yes, 0=no)	integer	2	0.0	0 / 0 / 1
iv_reop	Follow-up time to reoperation (years)	numeric	31	83.5	0.04 / 1.29 / 9.92
age	Age at surgery (years)	numeric	165	0.0	1 / 44.75 / 85
sex	Sex	factor	2	0.0	2 levels: Female, Male
bmi	Body mass index (kg/m2)	numeric	123	0.0	15 / 26.65 / 41.8
hgb_bs	Baseline hemoglobin (g/dL)	numeric	66	0.0	7.6 / 13 / 18
wbc_bs	Baseline WBC count (K/uL)	numeric	172	0.0	1.5 / 7.35 / 15.53
plate_bs	Baseline platelet count (K/uL)	numeric	135	0.0	50 / 225 / 447
gfr_bs	Baseline eGFR (mL/min/1.73m2)	numeric	174	0.0	25.9 / 75.75 / 120
lvefvs_b	Baseline LV ejection fraction (%)	numeric	149	0.0	29.1 / 53.75 / 75
lvmass_b	Baseline LV mass (g)	numeric	191	0.0	60 / 184.4 / 377.2
lvmsi_b	Baseline LV mass index (g/m2)	numeric	1	0.0	40 / 40 / 40
stvoli_b	Baseline SV index - systolic (mL/m2)	numeric	168	0.0	20.5 / 53.7 / 90.3
stvold_b	Baseline SV index - diastolic (mL/m2)	numeric	171	0.0	40 / 93.35 / 153.6
bypass_time	Cardiopulmonary bypass time (min)	numeric	100	0.0	20 / 88.5 / 194
xclamp_time	Aortic cross-clamp time (min)	numeric	75	0.0	13 / 57 / 118
nyha_class	NYHA functional class	ordered	4	0.0	4 levels: I, II, III, IV
diabetes	Diabetes mellitus	factor	2	0.0	2 levels: No, Yes
hypertension	Hypertension	factor	2	0.0	2 levels: No, Yes

4.3.2 Step 2: variable labels in one place

label_map() returns the current variable-to-label mapping as a two-column (key, label) data frame. Holding labels in one place is what keeps an axis title in a figure and a column header in a table reading the same way. Change a label here and it propagates everywhere the map is read, so you never hand-edit “Ejection Fraction” in five places and miss the sixth.

lm <- label_map(dat)
gt(head(lm, 10))

key	label
ccfid	Patient ID
origin_year	Calendar year for iv_opyrs = 0
iv_opyrs	Observation interval (years) since origin_year
iv_dead	Follow-up time to death (years)
dead	Death indicator (1=dead, 0=censored)
reop	Reoperation (1=yes, 0=no)
iv_reop	Follow-up time to reoperation (years)
age	Age at surgery (years)
sex	Sex
bmi	Body mass index (kg/m2)

4.3.3 Step 3: infer and check R data types

Raw extracts arrive with everything as character or numeric, and the wrong type quietly breaks an analysis: an ID read as a number gets averaged, a 0/1 flag read as a count gets summed. r_data_types() classifies each column by role and returns the dataset with columns coerced to the inferred type, so character IDs become factors and 0/1 indicators become logical. Read that sentence twice: it returns the coerced data, not a report about it. To see what changed, compare the source class against the inferred class yourself.

typed <- r_data_types(dat)
data.frame(
  variable = names(dat),
  source   = vapply(dat,   \(x) class(x)[1], character(1)),
  inferred = vapply(typed, \(x) class(x)[1], character(1)),
  row.names = NULL
) |>
  subset(source != inferred) |>
  gt() |>
  tab_header(title = "Columns retyped by r_data_types()")

variable	source	inferred
Columns retyped by r_data_types()
ccfid	character	factor
dead	integer	logical
reop	integer	logical

4.3.4 Step 4: pin the extract with a manifest

A dictionary documents what the data is; a manifest documents which data it is. A manifest records provenance for every extract: the source system, the extract date, the row count, and a SHA-256 hash of the file. update_manifest() appends an entry, hashing the file on disk, and verify_manifest() re-hashes the data directory against the manifest to confirm nothing has drifted since you wrote it down. The chunk below is not evaluated because it writes files to disk; run it interactively, then commit the manifest (never the data) to the repo.

saveRDS(dat, "cohort.rds")
mp <- tempfile(fileext = ".yaml")

update_manifest(
  file         = "cohort.rds",
  manifest_path = mp,
  extract_date = Sys.Date(),
  n_rows       = nrow(dat),
  source       = "CORR DWH"
)

verify_manifest(manifest_path = mp, data_dir = ".", stop_on_error = FALSE)

A representative manifest entry looks like this. The hash is the load-bearing field: it is what lets verify_manifest() catch a file that changed under you.

datasets:
- file: cohort.rds
  extract_date: '2026-05-28'
  n_rows: 200
  sha256: 8b3e0188eae294e61a16f0e09e3989e8b5249601209bde987fb5153c9ab1bd76
  source: CORR DWH

4.3.5 Step 5: compare versions on re-delivery

Data warehouses re-deliver extracts, and a re-delivery is rarely a no-op: rows get added, a column is renamed, a coding changes. compare_datasets() diffs the old and new versions and reports row-count and structural changes through its print method. It is the fast first check that a fresh pull is what you asked for before you rerun anything on it.

old <- generate_survival_data(n = 150, seed = 42)
new <- generate_survival_data(n = 200, seed = 99)
compare_datasets(old, new)

Dataset Comparison
  Rows: 150 -> 200

4.4 Read it

Two of these artifacts are reports you read, not just files you store.

A data dictionary is read column by column, and the two columns that earn the most attention are percent-missing and distinct values. High missingness flags a variable you cannot lean on without saying so. A distinct-value count that surprises you, two levels where you expected continuous, a thousand where you expected a category, usually means a typing or coding problem upstream, caught here before it reaches a model. The distribution summary is your sanity check on range: a negative age or an ejection fraction above 100 shows up immediately.

A dataset-comparison report is read for the diff, not the totals. A change in row count is expected on re-delivery and is fine if it matches what the warehouse told you to expect. The lines that should make you stop are structural ones: a column that appeared, disappeared, or changed type. Those silently break joins and recodes downstream, so confirm each one is intentional before you build anything on the new extract.

4.5 Pitfalls

Unpinned extracts. A figure built on “the data I pulled sometime last month” is not reproducible. Write a manifest the day the extract lands, commit it, and reference its hash from the analysis. The hash is cheap insurance against a silently swapped file.
Silent type coercion. r_data_types() returns the coerced data frame, not a summary of what it did. If you treat its output as a report you will miss that it changed your data. Diff source class against inferred class, as in step 3, and keep the typed object as the one you analyse.
Manifests not re-verified after a re-delivery. Writing a manifest once is not enough. When a new extract arrives, run verify_manifest() and compare_datasets() again. A manifest that nobody re-checks is documentation, not a guard rail.
PHI in the repo. Commit the manifest, the dictionary, and the code; never commit the data file itself. The whole point of the hash is to pin the extract without carrying protected health information into version control. Examples in a book stay on synthetic data for the same reason.