Skip to contents

CRAN already has several PDF packages. This vignette helps you pick the right one for the task — and explains where pdfium adds new capability rather than duplicating existing work. A more detailed contributor-facing inventory lives in dev/r-pdf-ecosystem-survey.md.

TL;DR — which package for which job?

Task First-line package
Read text only (whole-page strings) pdftools
Read text with per-token bounding boxes pdftools::pdf_data() (Poppler-precision) or pdfium::pdf_text_runs() (PDFium-precision, plus font flags)
Render a page to a bitmap pdftools::pdf_render_page() or pdfium::pdf_render_page()
Split / merge / compress lossless qpdf (or cpp11qpdf)
OCR or general image-processing pipeline magick
Extract a table from a PDF tabulapdf
Inspect path geometry (segments, Bezier control points, stroke/fill, transform matrices) pdfium — no other CRAN package surfaces this
Fill AcroForm fields without a JRE pdfium (staplr requires Java + pdftk)
Edit annotations (read + write) pdfium
Programmatically build PDFs — any page count, vector paths, standard-font text, annotations pdfium (also minipdf for a pure-R writer that additionally supports image embedding today)
Edit XMP metadata or bookmarks xmpdf (orchestrates exiftool / ghostscript / pdftk)

What pdfium adds

Three capabilities no other CRAN package surfaces today:

1. Vector path geometry

library(pdfium)
doc <- pdf_doc_open("figure.pdf")
objs <- pdf_page_objects(pdf_page_load(doc, 1))
# Pick the first path object and read its segments.
i <- match(TRUE, vapply(objs, pdf_obj_type, "") == "path")
pdf_path_segments(objs[[i]])
#> # A tibble: 8 x 5
#>   segment_type     x     y close_figure cp1_x …
#>   <chr>        <dbl> <dbl> <lgl>        <dbl>
#> 1 moveto         100   100 FALSE           NA
#> 2 lineto         200   100 FALSE           NA
#> 3 bezierto       300   100 FALSE          150  …
#> ...

Stroke / fill colors, dash patterns, transformation matrices, draw modes, and clip paths are all surfaced via pdf_path_*() and pdf_obj_*(). No equivalent exists in pdftools (Poppler exposes text only), qpdf (lossless structural ops, no content access), or magick (rasterises through Ghostscript).

2. AcroForm filling without Java

doc <- pdf_doc_open("application.pdf", readwrite = TRUE)
fields <- pdf_form_fields(doc)
by_name <- setNames(fields, vapply(fields, pdf_form_field_name, ""))
pdf_form_field_set_value(by_name[["full_name"]], "Ada Lovelace")
pdf_form_field_set_value(by_name[["subscribe"]], TRUE)
pdf_save(doc, "filled.pdf")

staplr is the only other CRAN package that can fill PDF forms, but it shells out to pdftk-java, which means installing a JRE + pdftk-java jar. pdfium’s form-fill API ships entirely as native code — no Java dependency.

3. Annotation authoring (full read + write)

hl <- pdf_annot_new(page, subtype = "highlight",
                    bounds = c(100, 700, 400, 720))
pdf_annot_set_color(hl, color = c(255, 240, 0))
pdf_annot_set_contents(hl, "Important")
pdf_annot_append_quad(hl, quad = c(100, 700, 400, 700,
                                    100, 720, 400, 720))
pdf_save(doc, "annotated.pdf")

No other CRAN package surfaces annotations at all. The full list of supported subtypes lives in ?pdf_annot_new.

4. Structural mutation without Java or shell-outs

The classic R answers for page rotation / N-up imposition / delete + reorder / language tagging are staplr (Java + pdftk) and xmpdf (orchestrates exiftool + ghostscript + pdftk). pdfium covers the same surface in-process, no external binaries:

Operation pdfium qpdf staplr xmpdf
Rotate page pdf_page_set_rotation() no rotate_pages() (Java) no
Delete page pdf_page_delete() pdf_split() + cherry-pick remove_pages() (Java) no
Reorder pages pdf_pages_reorder() manual pdf_split() + pdf_combine() select_pages() (Java) no
Merge documents pdf_docs_merge() pdf_combine() combine_pdfs() (Java) no
N-up imposition pdf_n_up() no no no
Set crop / media / trim / bleed / art box pdf_page_set_box() no no no
Set /Lang (accessibility tag) pdf_doc_set_language() no no partial (XMP only)
# 4-up imposition of a long report onto US Letter sheets.
doc <- pdf_doc_open("long-report.pdf")
pdf_n_up(doc, "report-4up.pdf", cols = 2L, rows = 2L)

# Reorder so the cover page lands first, then save in place.
doc <- pdf_doc_open("draft.pdf", readwrite = TRUE)
pdf_pages_reorder(doc, new_order = c(3L, 1L, 2L, 4:pdf_page_count(doc)))
pdf_save(doc, "draft.pdf")

# Tag the doc's primary language (improves screen-reader UX).
pdf_doc_set_language(doc, "en-US")
pdf_save(doc, "draft.pdf")

pdf_docs_merge() accepts a list of pdfium_doc handles or a list of paths, so you can stream-merge many files without keeping them all open at once. pdf_n_up() writes directly to disk via PDFium’s FPDF_ImportNPagesToOne — no intermediate render step, no Ghostscript subprocess.

5. Programmatic PDF authoring (with v0.1.0 limits)

pdf_doc_new() plus the page-object creators (pdf_path_new(), pdf_rect_new(), pdf_text_new(), pdf_image_new(), pdf_font_load() / pdf_font_load_standard(), plus the path-geometry appenders) let you build PDFs from scratch in R — vector graphics, JPEG images, text in the 14 PDF standard fonts, and arbitrary TrueType / Type1 typefaces. The mutating-pdfs vignette walks through the workflow.

What scales fine. Page count is unlimited (PDFium handles thousand-page docs efficiently); objects-per-page are unlimited; the full vector-graphics surface — paths, Bezier curves, dash patterns, transformation matrices, blend modes, opacity, clip paths — is exposed; annotations are richly covered. The R↔︎C boundary cost is microseconds per call, so 10⁶ object writes is seconds, not minutes.

v0.1.0 limits worth knowing about. Two authoring axes have real gaps in the current release — both blocked on upstream PDFium (the symbols don’t exist yet — we’ve proposed them but they need to ship through Google’s Gerrit review cycle, land in a PDFium release, and propagate to a bblanchon binary before we can wrap them):

Gap Missing PDFium symbol(s) Workaround today
/Info dict writes FPDF_SetMetaTextdrafted patch awaiting Gerrit upload Use xmpdf to patch the Info dict after pdf_save()
Encryption on save FPDF_SetEncryption — listed as CL 5 in dev/upstream-api-gaps.md; not yet drafted Use qpdf::pdf_encrypt() as a post-process step

The full upstream-PDFium gap inventory lives in dev/upstream-api-gaps.md. The “what scales” claims above hold today; the limits all have a known path to closure, with the per-table timing differences noted.

Where pdfium deliberately doesn’t compete

  • Lossless compress / re-encode / lineariseqpdf is the right answer. It’s content-preserving, doesn’t re-encode streams, and has been the de facto choice for years. pdfium’s structural mutation surface (see §4 above) overlaps on split / merge / reorder, but if your job is “compress this PDF” or “linearise for web view”, reach for qpdf::pdf_compress() / qpdf::pdf_optimize().
  • Table extractiontabulapdf (formerly tabulizer) has a decade of Tabula’s heuristics behind it. pdfium gives you text-with-bounds and path geometry — the primitives a future pure-R tabulapdf-style package could be built on — but doesn’t ship a table detector itself.
  • OCR and general image processingmagick is the right tool for the broader image-processing pipeline. pdfium::pdf_render_page() returns a pdfium_bitmap you can pass to magick::image_read() if you want to render with PDFium and then process with ImageMagick.
  • XMP metadataxmpdf orchestrates exiftool / ghostscript / pdftk correctly and writes both XMP and the Info dictionary. pdfium only reads the Info dict in v0.1.0; XMP and Info-write remain xmpdf’s territory.

Feature matrix at a glance

Feature pdfium pdftools qpdf magick tabulapdf staplr xmpdf
Text content yes yes no no partial no no
Text positioning yes (float precision) partial (int per token) no no partial (table region only) no no
Font metadata yes (per char) partial (per token) no no no no no
Render to bitmap yes (PDFium) yes (Poppler) no yes (Ghostscript) no no no
Document metadata (read) yes yes no partial no no yes
Document metadata (write) partial (lang only) no no no no no yes
Page count / size yes yes yes yes yes yes partial
Page rotation (read) yes no no no no yes no
Page rotation (write) yes no no no no yes (Java) no
Page reorder / merge / split yes no yes no no yes (Java) no
N-up imposition yes no no no no no no
Page boxes (crop / trim / bleed / art) yes no no no no no no
Document language (/Lang) write yes no no no no no partial (XMP)
Path segments yes no no no no (internal only) no no
Path style (stroke / fill / dash / matrix) yes no no no no no no
Bezier control points yes no no no no no no
Image XObject extraction yes no no no no no no
Form XObjects yes no no no no no no
Clip paths yes no no no no no no
Structure tree (tagged PDF) yes no no no no no no
Annotations (read) yes no no no no no no
Annotations (write) yes no no no no no no
Form fields (read) yes no no no no yes (Java) no
Form fields (fill) yes no no no no yes (Java) no
Page flatten yes no no no no no no
Attachments (read) yes yes no no no no no
Attachments (author) yes no no no no no no
Signatures (read) yes no no no no no no
Bookmarks (read) yes partial (toc) no no no no yes
Bookmarks (write) no no no no no no yes
Encryption / password partial (open only) yes yes no partial partial no

Bold rows are capabilities pdfium adds to the R ecosystem.

Switching from pdftools

The two packages overlap on text + render + metadata. The signatures are close enough that switching is mostly a find-and-replace:

pdftools pdfium
pdf_text(path) pdf_doc_text(path)
pdf_info(path) pdf_doc_info(path) — or pdf_doc_summary(path) for a richer one-row tibble
pdf_pagesize(path) pdf_pages_summary(path) (one row per page; also includes rotation + label)
pdf_render_page(path, ...) pdf_render_page(doc_or_path, ...)
pdf_data(path) pdf_text_runs(page)
pdf_doc_fonts(path) pdf_doc_fonts(doc)
pdf_attachments(path) pdf_attachments(doc)

The biggest behavioural difference: pdftools opens a fresh document on every call, while pdfium expects you to open once (pdf_doc_open()) and pass the resulting handle to subsequent functions. The path-accepting convenience wrappers (pdf_doc_text(path), pdf_attachments(path), etc.) work the same way pdftools does, but they’re shortcuts — for any non-trivial workflow, hold onto the pdfium_doc handle.