Returns one tibble row per character on the page, with the
character's Unicode codepoint and UTF-8 form, glyph bounding
box, effective font size, and two PDF flags indicating
"generated" characters (whitespace PDFium inferred between
positioned glyphs) and end-of-line "soft" hyphens. Wraps
FPDFText_LoadPage plus FPDFText_CountChars /
_GetUnicode / _GetCharBox / _GetFontSize /
_IsGenerated / _IsHyphen.
Arguments
- page
A
pdfium_pagefrompdf_page_load(), or apdfium_doc.- page_num
One-based page index. Only used when
pageis apdfium_doc. Ignored otherwise.
Value
A tibble with columns:
char_indexinteger - 1-based position in the page's character stream.codepointinteger - Unicode code point.charcharacter - UTF-8 character; empty for surrogate halves or PDFium's NUL sentinel.bounds_left,bounds_bottom,bounds_right,bounds_top- glyph bounding box in PDF user space.font_sizenumeric - effective glyph height in user-space points (the run's font size times the text matrix scale).is_generatedlogical -TRUEfor whitespace PDFium synthesised between positioned glyphs (the source PDF does not carry a character there; PDFium infers one for text-extraction consumers).is_hyphenlogical -TRUEfor end-of-line soft hyphens.origin_x,origin_y- the character's glyph origin point in PDF user space (FPDFText_GetCharOrigin). Distinct from the bounding-box corners; for many fonts the origin is at the baseline left of the glyph.loose_left,loose_bottom,loose_right,loose_top- the "loose" bounding box covering the entire glyph cell (font ascent / descent included), not just the glyph outline. Use these when you need consistent line heights; usebounds_*for the tight glyph extent.unicode_map_errorlogical -TRUEwhen PDFium detected that the character's ToUnicode CMap is malformed for this glyph (the codepoint reported may be the PDF's ` fallback rather than the intended character).text_indexinteger - 0-based position in the extractable text string (i.e. the linearpdf_doc_text()output) for this character, orNAfor synthesised whitespace and other characters that don't appear in the extracted text.char_font_namecharacter - the font name PDFium reports for this specific character (viaFPDFText_GetFontInfo). Per-character because pages can mix fonts within a single text run after PDFium re-flows characters during extraction.char_font_flagsinteger - the PDF Font Descriptor/Flagsbitmask for this character's font (PDF spec Table 121). Useful for detecting/Symbolic(bit 3) or/AllCap(bit 17) fonts whose ToUnicode mapping may be unreliable.
Returns a 0-row tibble of the same schema when the page has no text.
Details
This is the per-character analog of pdf_text_runs()
(per-text-object) and pdf_doc_text() (per-page). The three
coexist: use pdf_doc_text() when you just want the strings,
pdf_text_runs() for object-level positions, and
pdf_text_chars() when you need glyph-level geometry (e.g.
word segmentation, character-by-character layout analysis).