Choosing pdfium vs other PDF packages

CRAN already has several PDF packages. This vignette helps you pick the right one for the task — and explains where pdfium adds new capability rather than duplicating existing work. A more detailed contributor-facing inventory lives in dev/r-pdf-ecosystem-survey.md.

TL;DR — which package for which job?

Task	First-line package
Read text only (whole-page strings)	`pdftools`
Read text with per-token bounding boxes	`pdftools::pdf_data()` (Poppler-precision) or `pdfium::pdf_text_runs()` (PDFium-precision, plus font flags)
Render a page to a bitmap	`pdftools::pdf_render_page()` or `pdfium::pdf_render_page()`
Split / merge / compress lossless	`qpdf` (or `cpp11qpdf`)
OCR or general image-processing pipeline	`magick`
Extract a table from a PDF	`tabulapdf`
Inspect path geometry (segments, Bezier control points, stroke/fill, transform matrices)	`pdfium` — no other CRAN package surfaces this
Fill AcroForm fields without a JRE	`pdfium` (`staplr` requires Java + pdftk)
Edit annotations (read + write)	`pdfium`
Programmatically build PDFs — any page count, vector paths, standard-font text, annotations	`pdfium` (also `minipdf` for a pure-R writer that additionally supports image embedding today)
Edit XMP metadata or bookmarks	`xmpdf` (orchestrates `exiftool` / `ghostscript` / `pdftk`)

What `pdfium` adds

Three capabilities no other CRAN package surfaces today:

1. Vector path geometry

library(pdfium)
doc <- pdf_doc_open("figure.pdf")
objs <- pdf_page_objects(pdf_page_load(doc, 1))
# Pick the first path object and read its segments.
i <- match(TRUE, vapply(objs, pdf_obj_type, "") == "path")
pdf_path_segments(objs[[i]])
#> # A tibble: 8 x 5
#>   segment_type     x     y close_figure cp1_x …
#>   <chr>        <dbl> <dbl> <lgl>        <dbl>
#> 1 moveto         100   100 FALSE           NA
#> 2 lineto         200   100 FALSE           NA
#> 3 bezierto       300   100 FALSE          150  …
#> ...

Stroke / fill colors, dash patterns, transformation matrices, draw modes, and clip paths are all surfaced via pdf_path_*() and pdf_obj_*(). No equivalent exists in pdftools (Poppler exposes text only), qpdf (lossless structural ops, no content access), or magick (rasterises through Ghostscript).

2. AcroForm filling without Java

doc <- pdf_doc_open("application.pdf", readwrite = TRUE)
fields <- pdf_form_fields(doc)
by_name <- setNames(fields, vapply(fields, pdf_form_field_name, ""))
pdf_form_field_set_value(by_name[["full_name"]], "Ada Lovelace")
pdf_form_field_set_value(by_name[["subscribe"]], TRUE)
pdf_save(doc, "filled.pdf")

staplr is the only other CRAN package that can fill PDF forms, but it shells out to pdftk-java, which means installing a JRE + pdftk-java jar. pdfium’s form-fill API ships entirely as native code — no Java dependency.

3. Annotation authoring (full read + write)

hl <- pdf_annot_new(page, subtype = "highlight",
                    bounds = c(100, 700, 400, 720))
pdf_annot_set_color(hl, color = c(255, 240, 0))
pdf_annot_set_contents(hl, "Important")
pdf_annot_append_quad(hl, quad = c(100, 700, 400, 700,
                                    100, 720, 400, 720))
pdf_save(doc, "annotated.pdf")

No other CRAN package surfaces annotations at all. The full list of supported subtypes lives in ?pdf_annot_new.

4. Structural mutation without Java or shell-outs

The classic R answers for page rotation / N-up imposition / delete + reorder / language tagging are staplr (Java + pdftk) and xmpdf (orchestrates exiftool + ghostscript + pdftk). pdfium covers the same surface in-process, no external binaries:

Operation	`pdfium`	`qpdf`	`staplr`	`xmpdf`
Rotate page	`pdf_page_set_rotation()`	no	`rotate_pages()` (Java)	no
Delete page	`pdf_page_delete()`	`pdf_split()` + cherry-pick	`remove_pages()` (Java)	no
Reorder pages	`pdf_pages_reorder()`	manual `pdf_split()` + `pdf_combine()`	`select_pages()` (Java)	no
Merge documents	`pdf_docs_merge()`	`pdf_combine()`	`combine_pdfs()` (Java)	no
N-up imposition	`pdf_n_up()`	no	no	no
Set crop / media / trim / bleed / art box	`pdf_page_set_box()`	no	no	no
Set `/Lang` (accessibility tag)	`pdf_doc_set_language()`	no	no	partial (XMP only)

# 4-up imposition of a long report onto US Letter sheets.
doc <- pdf_doc_open("long-report.pdf")
pdf_n_up(doc, "report-4up.pdf", cols = 2L, rows = 2L)

# Reorder so the cover page lands first, then save in place.
doc <- pdf_doc_open("draft.pdf", readwrite = TRUE)
pdf_pages_reorder(doc, new_order = c(3L, 1L, 2L, 4:pdf_page_count(doc)))
pdf_save(doc, "draft.pdf")

# Tag the doc's primary language (improves screen-reader UX).
pdf_doc_set_language(doc, "en-US")
pdf_save(doc, "draft.pdf")

pdf_docs_merge() accepts a list of pdfium_doc handles or a list of paths, so you can stream-merge many files without keeping them all open at once. pdf_n_up() writes directly to disk via PDFium’s FPDF_ImportNPagesToOne — no intermediate render step, no Ghostscript subprocess.

5. Programmatic PDF authoring (with v0.1.0 limits)

pdf_doc_new() plus the page-object creators (pdf_path_new(), pdf_rect_new(), pdf_text_new(), pdf_image_new(), pdf_font_load() / pdf_font_load_standard(), plus the path-geometry appenders) let you build PDFs from scratch in R — vector graphics, JPEG images, text in the 14 PDF standard fonts, and arbitrary TrueType / Type1 typefaces. The mutating-pdfs vignette walks through the workflow.

What scales fine. Page count is unlimited (PDFium handles thousand-page docs efficiently); objects-per-page are unlimited; the full vector-graphics surface — paths, Bezier curves, dash patterns, transformation matrices, blend modes, opacity, clip paths — is exposed; annotations are richly covered. The R↔︎C boundary cost is microseconds per call, so 10⁶ object writes is seconds, not minutes.

v0.1.0 limits worth knowing about. Two authoring axes have real gaps in the current release — both blocked on upstream PDFium (the symbols don’t exist yet — we’ve proposed them but they need to ship through Google’s Gerrit review cycle, land in a PDFium release, and propagate to a bblanchon binary before we can wrap them):

Gap	Missing PDFium symbol(s)	Workaround today
`/Info` dict writes	`FPDF_SetMetaText` — drafted patch awaiting Gerrit upload	Use `xmpdf` to patch the Info dict after `pdf_save()`
Encryption on save	`FPDF_SetEncryption` — listed as CL 5 in `dev/upstream-api-gaps.md`; not yet drafted	Use `qpdf::pdf_encrypt()` as a post-process step

The full upstream-PDFium gap inventory lives in dev/upstream-api-gaps.md. The “what scales” claims above hold today; the limits all have a known path to closure, with the per-table timing differences noted.

Where `pdfium` deliberately doesn’t compete

Lossless compress / re-encode / linearise — qpdf is the right answer. It’s content-preserving, doesn’t re-encode streams, and has been the de facto choice for years. pdfium’s structural mutation surface (see §4 above) overlaps on split / merge / reorder, but if your job is “compress this PDF” or “linearise for web view”, reach for qpdf::pdf_compress() / qpdf::pdf_optimize().
Table extraction — tabulapdf (formerly tabulizer) has a decade of Tabula’s heuristics behind it. pdfium gives you text-with-bounds and path geometry — the primitives a future pure-R tabulapdf-style package could be built on — but doesn’t ship a table detector itself.
OCR and general image processing — magick is the right tool for the broader image-processing pipeline. pdfium::pdf_render_page() returns a pdfium_bitmap you can pass to magick::image_read() if you want to render with PDFium and then process with ImageMagick.
XMP metadata — xmpdf orchestrates exiftool / ghostscript / pdftk correctly and writes both XMP and the Info dictionary. pdfium only reads the Info dict in v0.1.0; XMP and Info-write remain xmpdf’s territory.

Feature matrix at a glance

Feature	pdfium	pdftools	qpdf	magick	tabulapdf	staplr	xmpdf
Text content	yes	yes	no	no	partial	no	no
Text positioning	yes (float precision)	partial (int per token)	no	no	partial (table region only)	no	no
Font metadata	yes (per char)	partial (per token)	no	no	no	no	no
Render to bitmap	yes (PDFium)	yes (Poppler)	no	yes (Ghostscript)	no	no	no
Document metadata (read)	yes	yes	no	partial	no	no	yes
Document metadata (write)	partial (lang only)	no	no	no	no	no	yes
Page count / size	yes	yes	yes	yes	yes	yes	partial
Page rotation (read)	yes	no	no	no	no	yes	no
Page rotation (write)	yes	no	no	no	no	yes (Java)	no
Page reorder / merge / split	yes	no	yes	no	no	yes (Java)	no
N-up imposition	yes	no	no	no	no	no	no
Page boxes (crop / trim / bleed / art)	yes	no	no	no	no	no	no
Document language (`/Lang`) write	yes	no	no	no	no	no	partial (XMP)
Path segments	yes	no	no	no	no (internal only)	no	no
Path style (stroke / fill / dash / matrix)	yes	no	no	no	no	no	no
Bezier control points	yes	no	no	no	no	no	no
Image XObject extraction	yes	no	no	no	no	no	no
Form XObjects	yes	no	no	no	no	no	no
Clip paths	yes	no	no	no	no	no	no
Structure tree (tagged PDF)	yes	no	no	no	no	no	no
Annotations (read)	yes	no	no	no	no	no	no
Annotations (write)	yes	no	no	no	no	no	no
Form fields (read)	yes	no	no	no	no	yes (Java)	no
Form fields (fill)	yes	no	no	no	no	yes (Java)	no
Page flatten	yes	no	no	no	no	no	no
Attachments (read)	yes	yes	no	no	no	no	no
Attachments (author)	yes	no	no	no	no	no	no
Signatures (read)	yes	no	no	no	no	no	no
Bookmarks (read)	yes	partial (toc)	no	no	no	no	yes
Bookmarks (write)	no	no	no	no	no	no	yes
Encryption / password	partial (open only)	yes	yes	no	partial	partial	no

Bold rows are capabilities pdfium adds to the R ecosystem.

Switching from `pdftools`

The two packages overlap on text + render + metadata. The signatures are close enough that switching is mostly a find-and-replace:

`pdftools`	`pdfium`
`pdf_text(path)`	`pdf_doc_text(path)`
`pdf_info(path)`	`pdf_doc_info(path)` — or `pdf_doc_summary(path)` for a richer one-row tibble
`pdf_pagesize(path)`	`pdf_pages_summary(path)` (one row per page; also includes rotation + label)
`pdf_render_page(path, ...)`	`pdf_render_page(doc_or_path, ...)`
`pdf_data(path)`	`pdf_text_runs(page)`
`pdf_doc_fonts(path)`	`pdf_doc_fonts(doc)`
`pdf_attachments(path)`	`pdf_attachments(doc)`

The biggest behavioural difference: pdftools opens a fresh document on every call, while pdfium expects you to open once (pdf_doc_open()) and pass the resulting handle to subsequent functions. The path-accepting convenience wrappers (pdf_doc_text(path), pdf_attachments(path), etc.) work the same way pdftools does, but they’re shortcuts — for any non-trivial workflow, hold onto the pdfium_doc handle.