CRAN already has several PDF packages. This vignette helps you pick
the right one for the task — and explains where pdfium adds
new capability rather than duplicating existing work. A more detailed
contributor-facing inventory lives in
dev/r-pdf-ecosystem-survey.md.
TL;DR — which package for which job?
| Task | First-line package |
|---|---|
| Read text only (whole-page strings) | pdftools |
| Read text with per-token bounding boxes |
pdftools::pdf_data() (Poppler-precision)
or pdfium::pdf_text_runs()
(PDFium-precision, plus font flags) |
| Render a page to a bitmap |
pdftools::pdf_render_page() or
pdfium::pdf_render_page()
|
| Split / merge / compress lossless |
qpdf (or cpp11qpdf) |
| OCR or general image-processing pipeline | magick |
| Extract a table from a PDF | tabulapdf |
| Inspect path geometry (segments, Bezier control points, stroke/fill, transform matrices) |
pdfium — no other CRAN package
surfaces this |
| Fill AcroForm fields without a JRE |
pdfium (staplr requires
Java + pdftk) |
| Edit annotations (read + write) | pdfium |
| Programmatically build PDFs — any page count, vector paths, standard-font text, annotations |
pdfium (also minipdf for
a pure-R writer that additionally supports image embedding today) |
| Edit XMP metadata or bookmarks |
xmpdf (orchestrates exiftool /
ghostscript / pdftk) |
What pdfium adds
Three capabilities no other CRAN package surfaces today:
1. Vector path geometry
library(pdfium)
doc <- pdf_doc_open("figure.pdf")
objs <- pdf_page_objects(pdf_page_load(doc, 1))
# Pick the first path object and read its segments.
i <- match(TRUE, vapply(objs, pdf_obj_type, "") == "path")
pdf_path_segments(objs[[i]])
#> # A tibble: 8 x 5
#> segment_type x y close_figure cp1_x …
#> <chr> <dbl> <dbl> <lgl> <dbl>
#> 1 moveto 100 100 FALSE NA
#> 2 lineto 200 100 FALSE NA
#> 3 bezierto 300 100 FALSE 150 …
#> ...Stroke / fill colors, dash patterns, transformation matrices, draw
modes, and clip paths are all surfaced via pdf_path_*() and
pdf_obj_*(). No equivalent exists in pdftools
(Poppler exposes text only), qpdf (lossless structural ops,
no content access), or magick (rasterises through
Ghostscript).
2. AcroForm filling without Java
doc <- pdf_doc_open("application.pdf", readwrite = TRUE)
fields <- pdf_form_fields(doc)
by_name <- setNames(fields, vapply(fields, pdf_form_field_name, ""))
pdf_form_field_set_value(by_name[["full_name"]], "Ada Lovelace")
pdf_form_field_set_value(by_name[["subscribe"]], TRUE)
pdf_save(doc, "filled.pdf")staplr is the only other CRAN package that can fill PDF
forms, but it shells out to pdftk-java, which means
installing a JRE + pdftk-java jar. pdfium’s form-fill API
ships entirely as native code — no Java dependency.
3. Annotation authoring (full read + write)
hl <- pdf_annot_new(page, subtype = "highlight",
bounds = c(100, 700, 400, 720))
pdf_annot_set_color(hl, color = c(255, 240, 0))
pdf_annot_set_contents(hl, "Important")
pdf_annot_append_quad(hl, quad = c(100, 700, 400, 700,
100, 720, 400, 720))
pdf_save(doc, "annotated.pdf")No other CRAN package surfaces annotations at all. The full list of
supported subtypes lives in ?pdf_annot_new.
4. Structural mutation without Java or shell-outs
The classic R answers for page rotation / N-up imposition / delete +
reorder / language tagging are staplr (Java + pdftk) and
xmpdf (orchestrates exiftool +
ghostscript + pdftk). pdfium
covers the same surface in-process, no external binaries:
| Operation | pdfium |
qpdf |
staplr |
xmpdf |
|---|---|---|---|---|
| Rotate page | pdf_page_set_rotation() |
no |
rotate_pages() (Java) |
no |
| Delete page | pdf_page_delete() |
pdf_split() + cherry-pick |
remove_pages() (Java) |
no |
| Reorder pages | pdf_pages_reorder() |
manual pdf_split() + pdf_combine()
|
select_pages() (Java) |
no |
| Merge documents | pdf_docs_merge() |
pdf_combine() |
combine_pdfs() (Java) |
no |
| N-up imposition | pdf_n_up() |
no | no | no |
| Set crop / media / trim / bleed / art box | pdf_page_set_box() |
no | no | no |
Set /Lang (accessibility tag) |
pdf_doc_set_language() |
no | no | partial (XMP only) |
# 4-up imposition of a long report onto US Letter sheets.
doc <- pdf_doc_open("long-report.pdf")
pdf_n_up(doc, "report-4up.pdf", cols = 2L, rows = 2L)
# Reorder so the cover page lands first, then save in place.
doc <- pdf_doc_open("draft.pdf", readwrite = TRUE)
pdf_pages_reorder(doc, new_order = c(3L, 1L, 2L, 4:pdf_page_count(doc)))
pdf_save(doc, "draft.pdf")
# Tag the doc's primary language (improves screen-reader UX).
pdf_doc_set_language(doc, "en-US")
pdf_save(doc, "draft.pdf")pdf_docs_merge() accepts a list of
pdfium_doc handles or a list of paths, so you can
stream-merge many files without keeping them all open at once.
pdf_n_up() writes directly to disk via PDFium’s
FPDF_ImportNPagesToOne — no intermediate render step, no
Ghostscript subprocess.
5. Programmatic PDF authoring (with v0.1.0 limits)
pdf_doc_new() plus the page-object creators
(pdf_path_new(), pdf_rect_new(),
pdf_text_new(), pdf_image_new(),
pdf_font_load() / pdf_font_load_standard(),
plus the path-geometry appenders) let you build PDFs from scratch in R —
vector graphics, JPEG images, text in the 14 PDF standard fonts, and
arbitrary TrueType / Type1 typefaces. The mutating-pdfs vignette walks
through the workflow.
What scales fine. Page count is unlimited (PDFium handles thousand-page docs efficiently); objects-per-page are unlimited; the full vector-graphics surface — paths, Bezier curves, dash patterns, transformation matrices, blend modes, opacity, clip paths — is exposed; annotations are richly covered. The R↔︎C boundary cost is microseconds per call, so 10⁶ object writes is seconds, not minutes.
v0.1.0 limits worth knowing about. Two authoring
axes have real gaps in the current release — both blocked on upstream
PDFium (the symbols don’t exist yet — we’ve proposed them but they need
to ship through Google’s Gerrit review cycle, land in a PDFium release,
and propagate to a bblanchon binary before we can wrap
them):
| Gap | Missing PDFium symbol(s) | Workaround today |
|---|---|---|
/Info dict writes |
FPDF_SetMetaText — drafted
patch awaiting Gerrit upload |
Use xmpdf to patch the Info dict after
pdf_save()
|
| Encryption on save |
FPDF_SetEncryption — listed as CL 5 in dev/upstream-api-gaps.md;
not yet drafted |
Use qpdf::pdf_encrypt() as a post-process step |
The full upstream-PDFium gap inventory lives in dev/upstream-api-gaps.md.
The “what scales” claims above hold today; the limits all have a known
path to closure, with the per-table timing differences noted.
Where pdfium deliberately doesn’t compete
-
Lossless compress / re-encode / linearise —
qpdfis the right answer. It’s content-preserving, doesn’t re-encode streams, and has been the de facto choice for years.pdfium’s structural mutation surface (see §4 above) overlaps on split / merge / reorder, but if your job is “compress this PDF” or “linearise for web view”, reach forqpdf::pdf_compress()/qpdf::pdf_optimize(). -
Table extraction —
tabulapdf(formerlytabulizer) has a decade of Tabula’s heuristics behind it.pdfiumgives you text-with-bounds and path geometry — the primitives a future pure-Rtabulapdf-style package could be built on — but doesn’t ship a table detector itself. -
OCR and general image processing —
magickis the right tool for the broader image-processing pipeline.pdfium::pdf_render_page()returns apdfium_bitmapyou can pass tomagick::image_read()if you want to render with PDFium and then process with ImageMagick. -
XMP metadata —
xmpdforchestratesexiftool/ghostscript/pdftkcorrectly and writes both XMP and the Info dictionary.pdfiumonly reads the Info dict in v0.1.0; XMP and Info-write remainxmpdf’s territory.
Feature matrix at a glance
| Feature | pdfium | pdftools | qpdf | magick | tabulapdf | staplr | xmpdf |
|---|---|---|---|---|---|---|---|
| Text content | yes | yes | no | no | partial | no | no |
| Text positioning | yes (float precision) | partial (int per token) | no | no | partial (table region only) | no | no |
| Font metadata | yes (per char) | partial (per token) | no | no | no | no | no |
| Render to bitmap | yes (PDFium) | yes (Poppler) | no | yes (Ghostscript) | no | no | no |
| Document metadata (read) | yes | yes | no | partial | no | no | yes |
| Document metadata (write) | partial (lang only) | no | no | no | no | no | yes |
| Page count / size | yes | yes | yes | yes | yes | yes | partial |
| Page rotation (read) | yes | no | no | no | no | yes | no |
| Page rotation (write) | yes | no | no | no | no | yes (Java) | no |
| Page reorder / merge / split | yes | no | yes | no | no | yes (Java) | no |
| N-up imposition | yes | no | no | no | no | no | no |
| Page boxes (crop / trim / bleed / art) | yes | no | no | no | no | no | no |
Document language (/Lang) write |
yes | no | no | no | no | no | partial (XMP) |
| Path segments | yes | no | no | no | no (internal only) | no | no |
| Path style (stroke / fill / dash / matrix) | yes | no | no | no | no | no | no |
| Bezier control points | yes | no | no | no | no | no | no |
| Image XObject extraction | yes | no | no | no | no | no | no |
| Form XObjects | yes | no | no | no | no | no | no |
| Clip paths | yes | no | no | no | no | no | no |
| Structure tree (tagged PDF) | yes | no | no | no | no | no | no |
| Annotations (read) | yes | no | no | no | no | no | no |
| Annotations (write) | yes | no | no | no | no | no | no |
| Form fields (read) | yes | no | no | no | no | yes (Java) | no |
| Form fields (fill) | yes | no | no | no | no | yes (Java) | no |
| Page flatten | yes | no | no | no | no | no | no |
| Attachments (read) | yes | yes | no | no | no | no | no |
| Attachments (author) | yes | no | no | no | no | no | no |
| Signatures (read) | yes | no | no | no | no | no | no |
| Bookmarks (read) | yes | partial (toc) | no | no | no | no | yes |
| Bookmarks (write) | no | no | no | no | no | no | yes |
| Encryption / password | partial (open only) | yes | yes | no | partial | partial | no |
Bold rows are capabilities pdfium adds to the R
ecosystem.
Switching from pdftools
The two packages overlap on text + render + metadata. The signatures are close enough that switching is mostly a find-and-replace:
pdftools |
pdfium |
|---|---|
pdf_text(path) |
pdf_doc_text(path) |
pdf_info(path) |
pdf_doc_info(path) — or
pdf_doc_summary(path) for a richer one-row tibble |
pdf_pagesize(path) |
pdf_pages_summary(path) (one row per page; also
includes rotation + label) |
pdf_render_page(path, ...) |
pdf_render_page(doc_or_path, ...) |
pdf_data(path) |
pdf_text_runs(page) |
pdf_doc_fonts(path) |
pdf_doc_fonts(doc) |
pdf_attachments(path) |
pdf_attachments(doc) |
The biggest behavioural difference: pdftools opens a
fresh document on every call, while pdfium expects you to
open once (pdf_doc_open()) and pass the resulting handle to
subsequent functions. The path-accepting convenience wrappers
(pdf_doc_text(path), pdf_attachments(path),
etc.) work the same way pdftools does, but they’re
shortcuts — for any non-trivial workflow, hold onto the
pdfium_doc handle.