Skip to contents

Records dataset metadata — including a SHA-256 checksum, row count, extract date, and optional provenance fields — into a manifest.yaml file. If the manifest already contains an entry for the named file it is updated in place; otherwise a new entry is appended. The manifest is intended to be committed to version control while the data files themselves are not.

Row counts are detected automatically for CSV (.csv) files. For SAS (.sas7bdat) and Excel (.xlsx, .xls) files, automatic row counting is considered "heavy" because it loads the entire dataset/workbook into memory; it is therefore disabled by default and only performed when options(manifest.allow_heavy_rowcount = TRUE) is set. For any other format, or when heavy counting is disabled, supply n_rows explicitly.

Usage

update_manifest(
  file,
  manifest_path = "manifest.yaml",
  extract_date = Sys.Date(),
  n_rows = NULL,
  source = NULL,
  sort_key = NULL
)

Arguments

file

Character. Path to the dataset file.

manifest_path

Character. Path to the manifest YAML file. Created if it does not exist. Defaults to "manifest.yaml" in the current working directory.

extract_date

Character or Date. The date the data were pulled from the source system. Stored as "YYYY-MM-DD". Defaults to today's date.

n_rows

Integer. Number of data rows. When NULL (default) the row count is detected automatically from CSV files, and from SAS/Excel files only when options(manifest.allow_heavy_rowcount = TRUE) is set. all other file types supply this value explicitly.

source

Character. Free-text description of the data source (e.g. "Epic EMR, query v4.2, ICD mapping v3.2").

sort_key

Character. Column name(s) that define the canonical sort order of the dataset.

Value

Invisibly returns the updated manifest as a named list.

See also

Examples

if (FALSE) { # \dontrun{
# --- CSV ------------------------------------------------------------
update_manifest(
  file         = here::here("datasets", "cohort_20240115.csv"),
  extract_date = "2024-01-15",
  source       = "Epic EMR, query v4.2, ICD mapping v3.2",
  sort_key     = "patient_id"
)

# --- SAS ------------------------------------------------------------
# .sas7bdat files exported from SAS or pulled via SASConnect
update_manifest(
  file         = here::here("datasets", "labs_20240115.sas7bdat"),
  extract_date = "2024-01-15",
  source       = "SAS dataset from CORR registry, labs module v2.1",
  sort_key     = "pat_id"
)

# --- Excel ----------------------------------------------------------
update_manifest(
  file         = here::here("datasets", "adjudication_20240115.xlsx"),
  extract_date = "2024-01-15",
  source       = "Clinical events committee adjudication log"
)

# --- Verify all three at once ---------------------------------------
verify_manifest(here::here("manifest.yaml"))
} # }