The problem: storing many CT versions compactly
CDISC Controlled Terminology is released quarterly. Each release contains roughly 45,000 rows. Storing every version as a separate file would mean 45,000 × 40+ releases ≈ 1.8 million rows and significant duplication, since most terms do not change between releases.
The solution: validity-date design
cdiscdata uses a validity-date design. Each row
represents a unique term-state: the set of values for a
term at a given point in time.
| Column | Meaning |
|---|---|
valid_from |
Date of the first CT release in which this term-state appeared |
valid_to |
Date of the last CT release in which this term-state appeared;
NA if still current |
When a term’s content changes between releases: 1. The old row is
closed by setting valid_to = new_release_date - 1.
2. A new row is added with valid_from = new_release_date
and valid_to = NA.
When a term is retired: - Its current row is closed with
valid_to = retirement_date - 1. - No new row is added.
Querying a specific version
get_ct() and get_dataset() filter the table
to rows valid at the requested date:
# Rows where valid_from <= query_date AND (valid_to >= query_date OR valid_to is NA)
tbl[tbl$valid_from <= query_date &
(is.na(tbl$valid_to) | tbl$valid_to >= query_date), ]The query date must exactly match one of the release dates returned
by available_ct_versions(). This ensures reproducibility:
the same version string always returns the same rows.