Skip to contents

The Problem: Datasets Drift

A dataset pulled from Epic today is not the same file you will get if you pull it again tomorrow — even when no one intentionally changed anything. Backfilled lab results, corrected ICD codes, and updated patient status all mean that re-running an extraction silently changes the underlying numbers.

The standard defence is a three-step protocol:

  1. Name every dataset by its extract datecohort_20240115.csv, labs_20240801.sas7bdat. Never overwrite an existing dated file; a new pull gets a new filename.
  2. Record a SHA-256 checksum the moment a file lands on disk. A single changed byte produces a completely different hash.
  3. Verify the checksums before every analysis run so that any drift is caught immediately, before results are generated.

hvtiRutilities provides two functions that implement this protocol:

Function Purpose
update_manifest() Register a dataset (or update its entry) in manifest.yaml
verify_manifest() Check every registered dataset before an analysis runs

Setup

if (requireNamespace("hvtiRutilities", quietly = TRUE)) {
  library("hvtiRutilities")
} else {
  pkgload::load_all(export_all = FALSE, helpers = FALSE, quiet = TRUE)
}
#> 
#>  hvtiRutilities 1.0.0.9004 
#>  
#>  Type hvtiRutilities.news() to see new features, changes, and bug fixes. 
#> 

For this vignette we work in a temporary directory that mimics a real project’s datasets/ folder.

datasets_dir <- file.path(tempdir(), "datasets")
dir.create(datasets_dir, showWarnings = FALSE)
manifest_path <- file.path(tempdir(), "manifest.yaml")

Registering Datasets

CSV — automatic row count

Generate a synthetic cohort (using generate_survival_data()) and write it as a date-stamped CSV.

set.seed(42)
cohort <- generate_survival_data(n = 200, seed = 42)

cohort_file <- file.path(datasets_dir, "cohort_20240115.csv")
write.csv(cohort, cohort_file, row.names = FALSE)

Register it in the manifest. Row count is detected automatically from the CSV header; only the SHA-256 is computed from the raw bytes.

update_manifest(
  file         = cohort_file,
  manifest_path = manifest_path,
  extract_date = "2024-01-15",
  source       = "Epic EMR, query v4.2, ICD mapping v3.2",
  sort_key     = "ccfid"
)
#> Manifest entry added: cohort_20240115.csv

Add a second file — a simulated labs extract pulled on the same date.

set.seed(7)
labs <- data.frame(
  ccfid   = cohort$ccfid,
  hgb     = round(rnorm(200, mean = 13.5, sd = 1.8), 1),
  egfr    = round(rnorm(200, mean = 72,   sd = 18),   0),
  bnp     = round(exp(rnorm(200, mean = 4.8, sd = 0.9)), 0)
)

labs_file <- file.path(datasets_dir, "labs_20240115.csv")
write.csv(labs, labs_file, row.names = FALSE)

update_manifest(
  file          = labs_file,
  manifest_path = manifest_path,
  extract_date  = "2024-01-15",
  source        = "Epic EMR, labs module — Hgb, eGFR, BNP",
  sort_key      = "ccfid"
)
#> Manifest entry added: labs_20240115.csv

Inspecting the manifest

The manifest is a plain YAML file committed to version control alongside your analysis scripts (but not alongside the data files, which stay out of git).

cat(paste(readLines(manifest_path), collapse = "\n"))
#> datasets:
#> - file: cohort_20240115.csv
#>   extract_date: '2024-01-15'
#>   n_rows: 200
#>   sha256: f3db9dd8a47765003a2509c54068f8736b0fd8c2f0b0425808422cc11f0bdfcd
#>   source: Epic EMR, query v4.2, ICD mapping v3.2
#>   sort_key: ccfid
#> - file: labs_20240115.csv
#>   extract_date: '2024-01-15'
#>   n_rows: 200
#>   sha256: 035acf2e981692554ba8e2c07f8b61f9d6ceb834343612d50d5487aed062b380
#>   source: Epic EMR, labs module — Hgb, eGFR, BNP
#>   sort_key: ccfid

Each entry records exactly what is needed to re-verify the file later: extract_date, n_rows, sha256, and any provenance notes you supply.

SAS datasets — automatic row count

Files exported directly from SAS (.sas7bdat) are handled the same way. update_manifest() calls haven::read_sas() internally to count rows; the checksum is always computed from the raw file bytes.

update_manifest(
  file          = here::here("datasets", "cohort_20240115.sas7bdat"),
  manifest_path = here::here("manifest.yaml"),
  extract_date  = "2024-01-15",
  source        = "SAS CORR registry extract, query v3.1",
  sort_key      = "pat_id"
)

Excel workbooks — automatic row count

Adjudication logs, crosswalk tables, and other curator-maintained datasets often arrive as .xlsx files. update_manifest() calls readxl::read_excel() to count rows.

update_manifest(
  file          = here::here("datasets", "adjudication_20240115.xlsx"),
  manifest_path = here::here("manifest.yaml"),
  extract_date  = "2024-01-15",
  source        = "Clinical Events Committee adjudication log, v2"
)

Other formats — supply n_rows explicitly

For RDS, Feather, Parquet, or any format not listed above, pass n_rows directly. The SHA-256 is still computed from the raw bytes.

update_manifest(
  file          = here::here("datasets", "cohort_20240115.rds"),
  manifest_path = here::here("manifest.yaml"),
  extract_date  = "2024-01-15",
  n_rows        = nrow(readRDS(here::here("datasets", "cohort_20240115.rds"))),
  source        = "RDS cached from Epic pull"
)

Verifying Before Every Analysis

Place verify_manifest() at the very top of every analysis script and Quarto document — before any data are loaded and before any results are produced.

verify_manifest(
  manifest_path = manifest_path,
  data_dir      = datasets_dir
)
#> cohort_20240115.csv — SHA-256 match (n = 200)
#> labs_20240115.csv — SHA-256 match (n = 200)

The function prints one confirmation line per file and returns invisibly. If everything is clean, your script continues.

Typical script header

library(hvtiRutilities)
library(here)

# Abort immediately if any registered dataset has changed
verify_manifest(here("manifest.yaml"))

# Only reached if all checksums pass
cohort <- read.csv(here("datasets", "cohort_20240115.csv"))
labs   <- read.csv(here("datasets", "labs_20240115.csv"))

What Happens When Data Changes

Suppose the data warehouse silently updates cohort_20240115.csv overnight. verify_manifest() catches it the moment the next analysis run starts.

# Simulate a silent dataset change
cohort_modified <- cohort
cohort_modified$age[1] <- cohort_modified$age[1] + 1L
write.csv(cohort_modified, cohort_file, row.names = FALSE)

# verify_manifest stops immediately
verify_manifest(
  manifest_path = manifest_path,
  data_dir      = datasets_dir
)
#> labs_20240115.csv — SHA-256 match (n = 200)
#> Error:
#> ! STOP: manifest verification failed for:
#>    cohort_20240115.csv: SHA-256 mismatch
#>   expected: f3db9dd8a47765003a2509c54068f8736b0fd8c2f0b0425808422cc11f0bdfcd
#>   actual:   5557af305fedf53ed5ee74c6a0aace29e4974a549a2a44fc5ff8c173d40a4e1a

The error message names the affected file and shows both the expected and actual SHA-256 values, giving you an unambiguous audit trail.

Collecting all failures at once

During development it can be useful to see every problem before fixing any of them. Set stop_on_error = FALSE to collect warnings instead of stopping.

report <- verify_manifest(
  manifest_path = manifest_path,
  data_dir      = datasets_dir,
  stop_on_error = FALSE
)
#> labs_20240115.csv — SHA-256 match (n = 200)
#> Warning: STOP: manifest verification failed for:
#>    cohort_20240115.csv: SHA-256 mismatch
#>   expected: f3db9dd8a47765003a2509c54068f8736b0fd8c2f0b0425808422cc11f0bdfcd
#>   actual:   5557af305fedf53ed5ee74c6a0aace29e4974a549a2a44fc5ff8c173d40a4e1a

report[report$status == "FAIL", c("file", "message")]
#>                  file
#> 1 cohort_20240115.csv
#>                                                                                                                                                                        message
#> 1 SHA-256 mismatch\n  expected: f3db9dd8a47765003a2509c54068f8736b0fd8c2f0b0425808422cc11f0bdfcd\n  actual:   5557af305fedf53ed5ee74c6a0aace29e4974a549a2a44fc5ff8c173d40a4e1a

Updating the manifest after a legitimate re-pull

If a new extract is intentional — a corrected pull, an updated query — call update_manifest() again with the same filename. The existing entry is replaced in place and the new checksum is recorded.

# Re-register after the legitimate change
update_manifest(
  file          = cohort_file,
  manifest_path = manifest_path,
  extract_date  = "2024-01-15",
  source        = "Epic EMR, query v4.2, ICD mapping v3.2 — corrected age for pt 1"
)
#> Manifest updated: cohort_20240115.csv

# Verification now passes again
verify_manifest(
  manifest_path = manifest_path,
  data_dir      = datasets_dir
)
#> cohort_20240115.csv — SHA-256 match (n = 200)
#> labs_20240115.csv — SHA-256 match (n = 200)

A New Pull for a Different Project

When the same registry is pulled again months later for a different analysis, it gets a new filename — never overwriting the original.

set.seed(99)
cohort_aug <- generate_survival_data(n = 315, seed = 99)

cohort_aug_file <- file.path(datasets_dir, "cohort_20240801.csv")
write.csv(cohort_aug, cohort_aug_file, row.names = FALSE)

update_manifest(
  file          = cohort_aug_file,
  manifest_path = manifest_path,
  extract_date  = "2024-08-01",
  source        = "Epic EMR, query v4.2, ICD mapping v3.3 — LVAD sub-study cohort",
  sort_key      = "ccfid"
)
#> Manifest entry added: cohort_20240801.csv

The manifest now tracks all three files independently.

m <- yaml::read_yaml(manifest_path)
do.call(rbind, lapply(m$datasets, function(e)
  data.frame(file = e$file, extract_date = e$extract_date,
             n_rows = e$n_rows, stringsAsFactors = FALSE)
))
#>                  file extract_date n_rows
#> 1 cohort_20240115.csv   2024-01-15    200
#> 2   labs_20240115.csv   2024-01-15    200
#> 3 cohort_20240801.csv   2024-08-01    315

Verification covers all of them in a single call.

verify_manifest(
  manifest_path = manifest_path,
  data_dir      = datasets_dir
)
#> cohort_20240115.csv — SHA-256 match (n = 200)
#> labs_20240115.csv — SHA-256 match (n = 200)
#> cohort_20240801.csv — SHA-256 match (n = 315)

Policy Recommendations

Stage Action
On every data pull Run update_manifest() immediately and commit manifest.yaml to git
Top of every analysis script Call verify_manifest() — before loading any data
At manuscript submission Tag the git commit; freeze manifest.yaml; do not re-pull data
For a new project or re-pull Use a new date-stamped filename; add a new update_manifest() entry
.gitignore Add datasets/ — commit the manifest, not the data

Summary

  • update_manifest() records the SHA-256 checksum, row count, extract date, and optional provenance fields for CSV, SAS, Excel, or any file type.
  • verify_manifest() checks every registered dataset before results are produced; one changed byte triggers an immediate, informative error.
  • The combination makes dataset drift detectable and the audit trail reproducible from first pull to manuscript submission.

Session Information

sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] hvtiRutilities_1.0.0.9004
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39    R6_2.6.1         labelled_2.16.0  fastmap_1.2.0   
#>  [5] tidyselect_1.2.1 xfun_0.57        magrittr_2.0.4   glue_1.8.0      
#>  [9] tibble_3.3.1     knitr_1.51       pkgconfig_2.0.3  htmltools_0.5.9 
#> [13] generics_0.1.4   rmarkdown_2.31   dplyr_1.2.0      lifecycle_1.0.5 
#> [17] cli_3.6.5        vctrs_0.7.2      compiler_4.5.3   forcats_1.0.1   
#> [21] haven_2.5.5      tools_4.5.3      hms_1.1.4        pillar_1.11.1   
#> [25] evaluate_1.0.5   yaml_2.3.12      rlang_1.1.7      jsonlite_2.0.0