This vignette explains the four-layer model pdfium uses,
the memory contract you rely on as a user, and the most common pitfalls.
For contributor-facing detail (CI topology, fixture rebuild pipeline,
PDFium bump procedure), see dev/architecture.md in the
source tree.
Four-layer model
R user
│
▼
R API pdf_doc_open(), pdf_doc_close(), pdf_page_count(), ...
│ S3 classes: pdfium_doc, pdfium_page, pdfium_obj
▼
Rcpp glue internal cpp_* helpers; you should never see them
│
▼
PDFium C ABI Google's PDFium engine (BSD-3-Clause)
│
▼
libpdfium shared library downloaded at install time
The R API takes user inputs, validates them, calls the Rcpp glue, and shapes the result back into idiomatic R (tibbles for tabular outputs, S3 objects for handles). You only ever interact with the R API.
Memory: what happens when you forget to close
inspect <- function(path) {
doc <- pdf_doc_open(path)
pdf_page_count(doc)
# `doc` goes out of scope here.
# R's GC will eventually finalize it and call FPDF_CloseDocument.
}
inspect("report.pdf")This is safe. pdfium registers a C finalizer on every
PDF handle. When the R object becomes unreachable, R’s garbage collector
reclaims it; the finalizer then calls PDFium’s
FPDF_CloseDocument to release the underlying memory.
The caveat is that GC is eventual, not deterministic. If you open many large documents in a tight loop, you may exhaust process memory before the GC catches up. In that case, close explicitly:
inspect <- function(path) {
doc <- pdf_doc_open(path)
on.exit(pdf_doc_close(doc), add = TRUE)
pdf_page_count(doc)
}pdf_doc_close() is idempotent: calling
it twice is a no-op. The finalizer notices the handle has already been
closed and skips its own close call. You can safely combine explicit
close with the automatic fallback.
Children outlive their parent
load_page <- function(path) {
doc <- pdf_doc_open(path)
page <- pdf_page_load(doc, 1) # available in Phase 1+
pdf_doc_close(doc) # this is fine
page # still usable here
}When you call pdf_page_load(doc, ...), the returned
pdfium_page holds an internal reference to its parent
pdfium_doc. Even if you drop your reference to
doc (or explicitly close it), the page stays valid until
the page object itself is collected. The underlying PDFium document is
kept alive in the background until the last page (or object) that
depends on it goes away.
The order is: child references the parent, never the other way around.
How the binary gets loaded
When you library(pdfium), this happens:
- R loads the package’s compiled
pdfium.so(orpdfium.dllon Windows). - The dynamic linker follows the RPATH baked in at install time and
loads
libpdfium.{so|dylib|dll}from the package’sinst/lib/. -
.onLoadcallsFPDF_InitLibraryWithConfig()exactly once. - When you call
library.unload("pdfium")or quit R,.onUnloadrunsFPDF_DestroyLibrary().
The libpdfium binary was downloaded the first time you
installed pdfium. The pinned release tag lives in
tools/pdfium-version.txt. If your machine has no internet
access at install time, set PDFIUM_OFFLINE=1 and place the
matching tarball under inst/pdfium-binaries/ before
installing.
Common gotchas
-
Holding many documents at once. GC is
non-deterministic; close explicitly with
pdf_doc_close(). - Deleting an open PDF on Windows. Windows blocks deletion of files held by open handles. Close the document first, then delete.
-
Using a closed handle. Functions that take a
pdfium_docraise an error if the handle has already been closed. Re-open the file if you need it again. -
Calling
FPDF_*directly. Don’t. The R API exists for a reason — bypassing it bypasses the lifetime tracking and you’ll crash R.
Further reading
- The decision records under
dev/decisions/capture every architectural choice and the alternatives that were considered. -
dev/architecture.mdcovers contributor-facing topics: CI topology, the PDFium bump procedure, and the fixture-rebuild pipeline. -
vignette("getting-started")walks through a complete inspection workflow.