Searches each page of the document for query and returns a row
per match with the page number, character offset, matched text,
and bounding box in PDF user-space points. Wraps PDFium's
FPDFText_FindStart / FPDFText_FindNext family.
Usage
pdf_text_search(
doc,
query,
case_sensitive = FALSE,
whole_word = FALSE,
consecutive = FALSE,
password = NULL
)Arguments
- doc
A
pdfium_docfrompdf_doc_open(), or a character path.- query
Single non-empty character string to find. Encoded to UTF-16LE before being handed to PDFium; any character representable in UTF-8 works (including supplementary-plane code points via surrogate pairs).
- case_sensitive
If
TRUE, only exact-case matches are returned. DefaultFALSE(case-insensitive ASCII letters; PDFium does not promise case folding for non-ASCII letters).- whole_word
If
TRUE, the match must be bounded by word-break characters (whitespace / punctuation) on both sides. DefaultFALSE.- consecutive
If
TRUE, after a match the next search resumes immediately after the match end; ifFALSE(default), PDFium skips ahead by one character before searching again, so overlapping matches are not reported.- password
Optional password for encrypted PDFs when
docis a path. Ignored whendocis already an openpdfium_doc.
Value
A tibble with one row per match and columns:
page(integer, 1-based)match_index(integer, 1-based withinpage)start_char(integer, 0-based character offset on the page)char_count(integer, number of characters in the match)text(character, the matched substring, UTF-8)left,bottom,right,top(numeric, axis-aligned union of the matched characters' bounding boxes in PDF user-space points;NAwhen PDFium reports no bounds, which can happen for glyphs without a positioned origin)
The tibble has zero rows when no matches are found. Column types are stable across the zero-row and non-zero-row cases.
Details
Match indexing is character-based: PDFium's text page is an
indexable sequence of glyph-derived characters in reading order,
and start_char is the 0-based offset of the first matched
character on that page. The same offset can be cross-referenced
against pdf_text_chars() output if you need per-character bounds
rather than per-match bounds.
Multi-line matches (where the matched text wraps across lines)
are reported as a single row whose bounding box is the axis-aligned
union of every contributing character's bounding box. If you need
one rectangle per line for highlighting, expand each row by
iterating pdf_text_chars() over start_char:(start_char + char_count - 1).
See also
pdf_doc_text() for whole-page text, pdf_text_runs() for
per-text-object structure, pdf_text_chars() for per-character
positions.
Examples
fixture <- system.file("extdata", "fixtures", "unicode.pdf",
package = "pdfium"
)
if (nzchar(fixture)) {
pdf_text_search(fixture, "Hello")
pdf_text_search(fixture, "WORLD", case_sensitive = FALSE)
}
#> # A tibble: 1 × 9
#> page match_index start_char char_count text left bottom right top
#> <int> <int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 7 5 world 129. 137. 158. 146.