Returns one row per text page-object on page, with the text
content, bounding box, font size, and 1-based page-object index.
Loads PDFium's per-page text-extraction context
(FPDFText_LoadPage) once and reuses it across every text
object on the page; this is materially faster than calling
pdf_text_content() in a loop, which opens and closes a text
page per object.
Arguments
- page
A
pdfium_pagefrompdf_page_load(), or apdfium_doc(in which case the first page is loaded and closed automatically).- page_num
One-based page index. Only used when
pageis apdfium_doc. Ignored otherwise.
Value
A tibble with columns:
obj_index- 1-based page-object index (so this row is theobj_index-th object returned bypdf_page_objects()). Renamed fromtext_indexin the v0.1.0 reader/writer audit to avoid colliding withpdf_text_chars()$text_index, which is the extractable-text offset.bounds_left,bounds_bottom,bounds_right,bounds_topthe object's bounding box in PDF points
font_size- typographic em size; multiply by the text object's matrix scale (when available) for rendered sizetext- UTF-8 string
Details
The returned tibble's schema matches the text_runs attribute
produced by pdf_extract_paths().
Examples
fixture <- system.file("extdata", "fixtures", "unicode.pdf",
package = "pdfium"
)
if (nzchar(fixture)) {
doc <- pdf_doc_open(fixture)
pdf_text_runs(doc, 1)
pdf_doc_close(doc)
}