Getting Started with hvtiRutilities
John Ehrlinger
2026-02-13
Source:vignettes/hvtiRutilities.Rmd
hvtiRutilities.RmdIntroduction
The hvtiRutilities package provides utility functions
for working with clinical research data at the Cleveland Clinic Heart,
Vascular and Thoracic Institute (HVTI). It simplifies common data
preparation tasks when importing and cleaning datasets, particularly
those originating from SAS.
Main Functions
-
r_data_types(): Automatically infer and convert data types based on content -
label_map(): Extract variable labels from labeled datasets into a lookup table -
sample_data(): Generate sample datasets for testing and examples
Installation
# Install from GitHub
# install.packages("pak")
pak::pak("ehrlinger/hvtiRutilities")
library(hvtiRutilities)
#>
#> hvtiRutilities 0.1.3
#>
#> Type hvtiRutilities.news() to see new features, changes, and bug fixes.
#> Basic Usage
Automatic Type Conversion with r_data_types()
The r_data_types() function intelligently converts
column types based on their content:
# Generate sample data with various types
dta <- sample_data(n = 100)
# Examine original structure
str(dta)
#> 'data.frame': 100 obs. of 7 variables:
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ boolean: int 1 1 2 1 1 2 1 1 1 2 ...
#> $ logical: chr "F" "F" "T" "F" ...
#> $ f_real : num 0.508 0.404 0.806 0.418 0.806 ...
#> $ float : num 0.118 -0.284 0.498 1.61 1.861 ...
#> $ char : chr "male" "male" "male" "male" ...
#> $ factor : Factor w/ 5 levels "C1","C2","C3",..: 3 1 5 1 4 3 1 2 1 1 ...Notice that the sample data has: - boolean: integer
values (1, 2) - logical: character values (“F”, “T”) -
char: character values (“male”, “female”)
Now let’s apply automatic type conversion:
# Convert types automatically
dta_converted <- r_data_types(dta)
# Examine converted structure
str(dta_converted)
#> 'data.frame': 100 obs. of 7 variables:
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> ..- attr(*, "label")= chr "id"
#> $ boolean: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
#> ..- attr(*, "label")= chr "boolean"
#> $ logical: Factor w/ 2 levels "F","T": 1 1 2 1 1 2 1 1 1 2 ...
#> ..- attr(*, "label")= chr "logical"
#> $ f_real : Factor w/ 9 levels "0.218431008746848",..: 4 2 8 3 8 7 8 8 9 3 ...
#> ..- attr(*, "label")= chr "f_real"
#> $ float : num 0.118 -0.284 0.498 1.61 1.861 ...
#> ..- attr(*, "label")= chr "float"
#> $ char : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 2 1 ...
#> ..- attr(*, "label")= chr "char"
#> $ factor : Factor w/ 5 levels "C1","C2","C3",..: 3 1 5 1 4 3 1 2 1 1 ...
#> ..- attr(*, "label")= chr "factor"After conversion: - boolean: converted to logical
(TRUE/FALSE) because it has exactly 2 unique values -
logical: converted to factor (categorical variable) -
char: converted to factor (categorical variable) -
Continuous variables (float, f_real) remain
numeric
Transformation Rules
The function applies transformations in this order:
- Character strings “NA”, “na”, “Na”, “nA” → actual
NAvalues - Numeric/integer with exactly 2 unique values →
logical - Character columns →
factor - Numeric with 3 to
factor_sizeunique values →factor - Optionally: logical →
factor(ifbinary_factor = TRUE)
Working with Real Data
Example: mtcars Dataset
Let’s apply this to the built-in mtcars dataset:
# Original mtcars
str(mtcars[, 1:5])
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# Apply type conversion
mtcars_clean <- r_data_types(mtcars)
str(mtcars_clean[, 1:5])
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..- attr(*, "label")= chr "mpg"
#> $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
#> ..- attr(*, "label")= chr "cyl"
#> $ disp: num 160 160 108 258 360 ...
#> ..- attr(*, "label")= chr "disp"
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> ..- attr(*, "label")= chr "hp"
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> ..- attr(*, "label")= chr "drat"Notice: - cyl (3 unique values: 4, 6, 8) → factor -
vs (2 unique values: 0, 1) → logical - am (2
unique values: 0, 1) → logical - gear (3 unique values) →
factor - carb (6 unique values) → factor
Controlling Factor Conversion
Use factor_size to control when numeric variables become
factors:
# More strict: only convert if < 5 unique values
mtcars_strict <- r_data_types(mtcars, factor_size = 5)
# Check cylinder variable
class(mtcars_clean$cyl) # factor (3 unique < 10)
#> [1] "factor"
class(mtcars_strict$cyl) # factor (3 unique < 5)
#> [1] "factor"
# Check carb variable
class(mtcars_clean$carb) # factor (6 unique < 10)
#> [1] "factor"
class(mtcars_strict$carb) # integer (6 unique NOT < 5)
#> [1] "numeric"Skipping Specific Variables
Sometimes you want to preserve certain variables in their original form:
# Keep vs and am as numeric instead of converting to logical
mtcars_partial <- r_data_types(mtcars, skip_vars = c("vs", "am"))
# Compare
class(mtcars_clean$vs) # logical (converted)
#> [1] "logical"
class(mtcars_partial$vs) # numeric (preserved)
#> [1] "numeric"Binary Variables as Factors
By default, binary variables become logical. Use
binary_factor = TRUE to make them factors instead:
mtcars_factor <- r_data_types(mtcars, binary_factor = TRUE)
# Compare
class(mtcars_clean$vs) # logical
#> [1] "logical"
class(mtcars_factor$vs) # factor
#> [1] "factor"This can be useful for modeling or visualization where factor levels are preferred.
Working with Variable Labels
Creating and Extracting Labels
Variable labels are common in clinical datasets, especially those
imported from SAS. The label_map() function creates a
lookup table:
library(labelled)
# Create a dataset with labels
patient_data <- data.frame(
patient_id = 1:5,
age = c(45, 52, 38, 61, 29),
sex = c("M", "F", "M", "F", "M"),
sbp = c(120, 135, 118, 142, 125),
dbp = c(80, 85, 75, 90, 82),
stringsAsFactors = FALSE
)
# Add descriptive labels (as would come from SAS)
var_label(patient_data$patient_id) <- "Patient Identifier"
var_label(patient_data$age) <- "Age at Enrollment (years)"
var_label(patient_data$sex) <- "Biological Sex"
var_label(patient_data$sbp) <- "Systolic Blood Pressure (mmHg)"
var_label(patient_data$dbp) <- "Diastolic Blood Pressure (mmHg)"
# Extract labels into a lookup table
labels <- label_map(patient_data)
print(labels)
#> key label
#> patient_id patient_id Patient Identifier
#> age age Age at Enrollment (years)
#> sex sex Biological Sex
#> sbp sbp Systolic Blood Pressure (mmHg)
#> dbp dbp Diastolic Blood Pressure (mmHg)Using Labels in Analysis
The label map is useful for creating publication-ready tables and plots:
# Create a summary statistics table
summary_stats <- data.frame(
variable = c("age", "sbp", "dbp"),
mean = c(mean(patient_data$age),
mean(patient_data$sbp),
mean(patient_data$dbp)),
sd = c(sd(patient_data$age),
sd(patient_data$sbp),
sd(patient_data$dbp))
)
# Add descriptive labels
summary_stats$description <- labels$label[match(summary_stats$variable, labels$key)]
print(summary_stats)
#> variable mean sd description
#> 1 age 45.0 12.34909 Age at Enrollment (years)
#> 2 sbp 128.0 10.22252 Systolic Blood Pressure (mmHg)
#> 3 dbp 82.4 5.59464 Diastolic Blood Pressure (mmHg)Labels Persist Through Transformations
Labels are preserved when using r_data_types():
# Convert types
patient_clean <- r_data_types(patient_data, skip_vars = "patient_id")
# Labels are preserved
var_label(patient_clean$age)
#> [1] "Age at Enrollment (years)"
var_label(patient_clean$sex)
#> [1] "Biological Sex"
# Extract labels from converted data
labels_clean <- label_map(patient_clean)
identical(labels, labels_clean) # TRUE
#> [1] TRUEComplete Workflow Example
Here’s a complete real-world workflow for preparing clinical data:
# Step 1: Generate sample clinical data
set.seed(123)
clinical <- data.frame(
id = 1:50,
center = sample(c("Site A", "Site B", "Site C"), 50, replace = TRUE),
treatment = sample(1:4, 50, replace = TRUE),
age = round(rnorm(50, mean = 55, sd = 12)),
sex = sample(c("M", "F"), 50, replace = TRUE),
outcome = sample(0:1, 50, replace = TRUE, prob = c(0.3, 0.7)),
followup_days = round(runif(50, 30, 730)),
adverse_event = sample(c("None", "Mild", "Moderate", "Severe", "NA"),
50, replace = TRUE),
stringsAsFactors = FALSE
)
# Step 2: Add variable labels (as from SAS import)
var_label(clinical$id) <- "Patient ID"
var_label(clinical$center) <- "Enrollment Center"
var_label(clinical$treatment) <- "Treatment Arm (1-4)"
var_label(clinical$age) <- "Age at Enrollment (years)"
var_label(clinical$sex) <- "Biological Sex"
var_label(clinical$outcome) <- "Primary Outcome (0=Failure, 1=Success)"
var_label(clinical$followup_days) <- "Days of Follow-up"
var_label(clinical$adverse_event) <- "Most Severe Adverse Event"
# Step 3: Clean and convert types
clinical_clean <- r_data_types(
clinical,
factor_size = 5, # Only convert if < 5 unique values
skip_vars = "id" # Keep ID as integer
)
# Step 4: Examine results
str(clinical_clean)
#> 'data.frame': 50 obs. of 8 variables:
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> ..- attr(*, "label")= chr "Patient ID"
#> $ center : Factor w/ 3 levels "Site A","Site B",..: 3 3 3 2 3 2 2 2 3 1 ...
#> ..- attr(*, "label")= chr "Enrollment Center"
#> $ treatment : Factor w/ 4 levels "1","2","3","4": 1 1 4 4 3 1 2 1 1 3 ...
#> ..- attr(*, "label")= chr "Treatment Arm (1-4)"
#> $ age : num 42 49 74 42 75 52 49 46 81 39 ...
#> ..- attr(*, "label")= chr "Age at Enrollment (years)"
#> $ sex : Factor w/ 2 levels "F","M": 1 1 2 2 1 1 1 1 2 1 ...
#> ..- attr(*, "label")= chr "Biological Sex"
#> $ outcome : logi FALSE FALSE FALSE TRUE FALSE TRUE ...
#> ..- attr(*, "label")= chr "Primary Outcome (0=Failure, 1=Success)"
#> $ followup_days: num 375 303 355 529 69 278 592 615 196 278 ...
#> ..- attr(*, "label")= chr "Days of Follow-up"
#> $ adverse_event: Factor w/ 4 levels "Mild","Moderate",..: 4 1 NA 3 3 1 3 1 NA 3 ...
#> ..- attr(*, "label")= chr "Most Severe Adverse Event"
# Step 5: Extract labels for reporting
label_lookup <- label_map(clinical_clean)
# Step 6: Use in analysis
# Count by treatment
table(clinical_clean$treatment)
#>
#> 1 2 3 4
#> 15 13 12 10
# Outcome by treatment (using labels)
outcome_summary <- aggregate(
outcome ~ treatment,
data = clinical_clean,
FUN = function(x) c(n = length(x), success = sum(x), rate = mean(x))
)
print(outcome_summary)
#> treatment outcome.n outcome.success outcome.rate
#> 1 1 15.0000000 9.0000000 0.6000000
#> 2 2 13.0000000 8.0000000 0.6153846
#> 3 3 12.0000000 9.0000000 0.7500000
#> 4 4 10.0000000 7.0000000 0.7000000Handling Missing Data
The function automatically handles character NA variants:
# Data with various NA representations
messy <- data.frame(
var1 = c("NA", "value1", "na", "value2", "nA"),
var2 = c("NA", "A", "na", "B", "C"),
var3 = c(1, 2, NA, 4, 5),
stringsAsFactors = FALSE
)
clean <- r_data_types(messy)
# Character NAs converted to true NA
sum(is.na(clean$var1)) # 3 NAs
#> [1] 3
levels(clean$var1) # Only "value1" and "value2"
#> [1] "value1" "value2"
sum(is.na(clean$var2)) # 2 NAs
#> [1] 2
levels(clean$var2) # "A", "B", "C"
#> [1] "A" "B" "C"Advanced Usage
Custom Workflows for Specific Data Types
For datasets with specific requirements:
# Lab results with reference ranges
labs <- data.frame(
patient = 1:20,
test_name = rep(c("Glucose", "HbA1c"), 10),
value = c(rnorm(10, 100, 15), rnorm(10, 6.5, 1)),
unit = rep(c("mg/dL", "%"), 10),
flag = sample(c("Normal", "High", "Low"), 20, replace = TRUE),
stringsAsFactors = FALSE
)
# Convert with specific settings
labs_clean <- r_data_types(
labs,
skip_vars = c("patient", "value"), # Preserve ID and numeric values
factor_size = 4 # Conservative factor conversion
)
str(labs_clean)
#> 'data.frame': 20 obs. of 5 variables:
#> $ patient : int 1 2 3 4 5 6 7 8 9 10 ...
#> ..- attr(*, "label")= chr "patient"
#> $ test_name: Factor w/ 2 levels "Glucose","HbA1c": 1 2 1 2 1 2 1 2 1 2 ...
#> ..- attr(*, "label")= chr "test_name"
#> $ value : num 85.1 125.1 93.4 89.2 81.5 ...
#> ..- attr(*, "label")= chr "value"
#> $ unit : Factor w/ 2 levels "%","mg/dL": 2 1 2 1 2 1 2 1 2 1 ...
#> ..- attr(*, "label")= chr "unit"
#> $ flag : Factor w/ 3 levels "High","Low","Normal": 1 2 2 3 1 2 3 3 2 1 ...
#> ..- attr(*, "label")= chr "flag"Integration with Data Import
Typical workflow when importing from SAS:
# Read SAS dataset (example - not run)
# library(haven)
# sas_data <- read_sas("path/to/data.sas7bdat")
# Apply type conversion and extract labels
# clean_data <- r_data_types(sas_data, factor_size = 15)
# variable_labels <- label_map(clean_data)
# Save labels for documentation
# write.csv(variable_labels, "data_dictionary.csv", row.names = FALSE)Best Practices
When to Use r_data_types()
Use it when: - Importing data from SAS, SPSS, or other statistical software - Working with datasets where types aren’t correctly inferred - You have many categorical variables coded as integers - You need consistent type handling across multiple datasets
Skip it when: - Your data types are already correct - You need very specific type conversions not covered by the function - Working with specialized data structures (time series, spatial data, etc.)
Recommended Settings by Use Case
Exploratory Analysis:
data_clean <- r_data_types(data, factor_size = 10)Modeling/Regression:
data_clean <- r_data_types(data, factor_size = 5, binary_factor = FALSE)Descriptive Statistics/Tables:
data_clean <- r_data_types(data, factor_size = 15, binary_factor = TRUE)Summary
The hvtiRutilities package streamlines data preparation
for clinical research:
-
r_data_types(): Automatic, intelligent type conversion -
label_map(): Easy extraction of variable labels -
sample_data(): Generate test data
Key features: - Preserves variable labels through transformations - Handles multiple NA representations - Flexible control over factor conversion - Works with data.frames, tibbles, and data.tables
For more information: - Package documentation:
?r_data_types, ?label_map - GitHub: https://github.com/ehrlinger/hvtiRutilities - Issues: https://github.com/ehrlinger/hvtiRutilities/issues -
Release notes: Run hvtiRutilities.news() in R
Session Information
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] labelled_2.16.0 hvtiRutilities_0.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.7.1 cli_3.6.5 knitr_1.51 rlang_1.1.7
#> [5] xfun_0.56 forcats_1.0.1 haven_2.5.5 generics_0.1.4
#> [9] textshaping_1.0.4 jsonlite_2.0.0 glue_1.8.0 htmltools_0.5.9
#> [13] ragg_1.5.0 sass_0.4.10 hms_1.1.4 rmarkdown_2.30
#> [17] tibble_3.3.1 evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [21] yaml_2.3.12 lifecycle_1.0.5 compiler_4.5.2 dplyr_1.2.0
#> [25] fs_1.6.6 pkgconfig_2.0.3 systemfonts_1.3.1 digest_0.6.39
#> [29] R6_2.6.1 tidyselect_1.2.1 pillar_1.11.1 magrittr_2.0.4
#> [33] bslib_0.10.0 withr_3.0.2 tools_4.5.2 pkgdown_2.2.0
#> [37] cachem_1.1.0 desc_1.4.3