Cardiovascular Outcome Registries and Research (CORR)

1997 CORR

  • SAS shop
  • Goal: any analyst can pick up any job
  • Folder structure with file naming conventions
  • A social contract — loosely enforced

Today

  • R and SAS hybrid
  • Team grows to 19 (and more)
  • Conventions extended to R

We built methods in-house. Now we maintain them too.

Discipline doesn’t scale under production pressure

CORR Production Scale

19 CORR staff ~150 data requests / year ~100+ presentations / year ~50 manuscripts / year 1,100+ HVTI publications / year

CORR Assembly Line

EHR / Registry Data Warehouse Analysis Dataset Analysis Report Manuscript Data Engineers Stat Programmers Biostatisticians R/Medicine focus Researchers PIs / Fellows CORR scope Every inefficiency here multiplies across all deliverables

What a Biostatistician Delivers

Study
Researcher
Analysis Report
CORR
Manuscript
Researcher
Research questions Data ingestion Introduction
Statistical approaches Model building Methods
Communication Figures & Tables Results
Discussion

“Bus insurance” — any analyst can pick up any job.

Reproducibility at scale — Three Focus Areas

  1. Data ingestion
  2. Model building
  3. Figures and tables

Make the safe path the easy path.

Data Ingestion

Make the safe path the easy path.

Problem: Data file tracking

Jan 12
cohort.sas7bdat
Initial extract from Data Warehouse
Feb 3, 2:22 PM
cohort_final.sas7bdat
Exclusions applied — but which ones?
Feb 3, 4:47 PM
cohort_final2.sas7bdat
Researcher comments — additional exclusion added
Mar 15
cohort_final_USE_THIS.sas7bdat
PI changed date range — modified date reset on server sync
Apr 2 — today
cohort_final_USE_THIS_v2.sas7bdat
What changed? Nobody remembers.

Is this the right data? Has it changed?

How we’re addressing it In development

Layer governance on top of existing structure.

What we have

study_root/
  analyses/      ← jobs
  datasets/      ← *this problem*
  descriptive/
  distributions/
  documents/     ← deliverables
  estimates/     ← cached results
  graphs/        ← pdf

What we’re adding

  • Date in the filename
    never overwrite
  • Dataset manifest
    source, date, parameters
  • Checksum
    detect silent changes
  • Audit trail
    job ↔︎ manifest

Manifest + checksum in practice In development

datasets/built260430.manifest.yml

dataset:      built260430.sas7bdat
source:       Cardiopulse / STS registry v2.1
pulled_by:    data_manager
pull_date:    2026-04-30
sha256:       a3f8c2d1e4b7f093...
observations: 12847
variables:    428

Register explicitly — separate from build

digest::digest(
  "datasets/built260430.sas7bdat",
  algo = "sha256", file = TRUE)

Verify before analysis

stored  <- yaml::read_yaml(
  "datasets/built260430.manifest.yml")$sha256
current <- digest::digest(
  "datasets/built260430.sas7bdat",
  algo = "sha256", file = TRUE)
stopifnot(current == stored)

Filename says when. Checksum says whether.

Rebuild without re-registering → every downstream job fails.
That failure is the feature.

Model Building

Make the safe path the easy path.

Problem: R and SAS Environments

SAS

  • SAS version not locked to a study
  • Shared Macro library unversioned
  • Shared server

R & RStudio

  • R version is not locked to a study
  • Shared Package library — package versions drift
  • Shared server
  • Serialized models in estimates/ fail to load
  • Results don’t match — or the job won’t run

Can you reproduce this analysis six months from now?

R: renv with RStudio projects In flight

Without renv
Packages installed globally — shared across all projects
update.packages() breaks other analyses silently
Serialized models in estimates/ fail to load 6 months later
“Works on my machine” — but not yours
New analyst hunts down exact versions to reproduce results
With renv
Day 1 — any analyst, any machine
Isolate the project library and freeze it
renv::init()       # isolate library
renv::install("ehrlinger/hvtiPlotR")
renv::snapshot()  # freeze versions
renv.lock — R 4.4.1 · hvtiPlotR 0.3.2 · …
commit this; it travels with the project
6 months later — different analyst, different machine
Open .Rproj, run one command
renv::restore()  # rebuilds library from renv.lock
Same packages. Same results. The job belongs to the team.

Problem: Shared function libraries aren’t versioned

SAS macro library

  • Centralized — everyone calls the same macros
  • Not versioned — no record of which version ran
  • Changes propagate silently

R functions

  • Copied at study creation — diverges immediately
  • Fixes stay local; no propagation
  • In-flight studies never get updates

SAS: did that update break something? R: did anyone else get that fix?

Versioned internal packages In development

An R package is a versioned, distributable library.

Functions in the study folder

  • Frozen at copy
  • Fixes don’t propagate
  • No version record

Code in a package

  • One canonical version
  • Pull fixes deliberately with renv::update()
  • Examples travel with code, double as tests

The hard problems shift — they don’t disappear.

  • Someone must own releases.
  • Improvements must be communicated — they don’t propagate automatically.
  • Studies must change versions deliberately — controlled, not automatic.

Problem: Code and results live separately

SAS — partially solved this

  • .lst pairs code and output
  • Figures still separate files

R scripts — no equivalent

  • Code and output in separate files
  • No standard structure
  • Figures disconnected from code

Can a new analyst navigate this job without asking whoever wrote it?

Quarto templates In development

report.qmd = job.sas

  • Code and output in one document
  • Figures inline
  • Renders to HTML, PDF, Word

Always render before committing.

Standard 4-section template

# Data.qmd
Ingestion, exclusions, cohort description

# Methods.qmd
Model specification and assumptions

# Explainability.qmd
Diagnostics, sensitivity checks

# Results.qmd
Tables, figures, interpretation

The structure is the contract.

Figures and Tables

Make the safe path the easy path.

Problem: Publication-quality graphics

SAS: plot.sas

  • Hard-won publication quality
  • Each output target: more manual effort

R: ggplot2

  • Same per-target effort problem
  • No shared themes or palettes

Can we capture that effort once — make figures consistent?

R: graphics pipeline In flight

The package does the hard work. The analyst decorates.

Package owns

  • The statistical transform
  • The base figure — correct by construction
  • Themes: manuscript, PowerPoint, poster

Analyst owns

  • Axis labels, annotations
  • Nothing else needs to change
# Package handles the figure
fig <- plot(corr_analysis(data))

# Analyst decorates to context
fig +
  hv_theme("manuscript") +
  scale_color_brewer(palette = "Set1") +
  annotate("text", ...)

If it looks like the package example, it’s correct.

Our visualization packages In flight

ggRandomForestsgithub.com/ehrlinger/ggRandomForests

Variable importance, partial depth, minimal depth — ggplot2 for randomForestSRC.


hvtiPlotRgithub.com/ehrlinger/hvtiPlotR

Survival curves, hazard plots, forest plots in CORR themes.

Bridges ggplot2 to plot.sas standards.


As the science keeps moving and CORR evolves, so do the packages.

Survival curves — ggRandomForests

Survival difference between two treatment groups (blue, red)
randomForestSRC fits a survival forest; ggRandomForests plots the results.

Problem: Publication-quality Tables

Right now, this is all SAS.

gt and gtsummary exist — they aren’t tamed to our standards yet.

That’s next.

Honestly

Status
Data Ingestion
Study folder structure Adopted — standard
Dataset naming + manifests + checksums In development
Model Building
RStudio Projects + renv In flight — “easy” win
Quarto templates In development
hazard (SAS), randomForestSRC (R)
TemporalHazard (R)
Adopted
In development
Figures & Tables
ggRandomForests, hvtiPlotR In flight
hvtiRtables Aspirational — coming “soon”

The framing shift that mattered

“Reproducibility” → “Production reliability”

“Any biostatistician can pick up any job.”

Make the safe path the easy path.

Resources

https://github.com/ehrlinger

Package
hazard (SAS) github.com/ehrlinger/hazard
TemporalHazard github.com/ehrlinger/temporal_hazard
randomForestSRC cran.r-project.org/package=randomForestSRC
ggRandomForests github.com/ehrlinger/ggRandomForests
hvtiPlotR github.com/ehrlinger/hvtiPlotR
hvtiRutilities github.com/ehrlinger/hvtiRutilities

These slides: github.com/ehrlinger/CareFeedingBiostats

Contact: ehrlinj@ccf.org

We’re hiring a biostatistician — Contact me!