Find every occurrence of a query string in a PDF

Searches each page of the document for query and returns a row per match with the page number, character offset, matched text, and bounding box in PDF user-space points. Wraps PDFium's FPDFText_FindStart / FPDFText_FindNext family.

Usage

pdf_text_search(
  doc,
  query,
  case_sensitive = FALSE,
  whole_word = FALSE,
  consecutive = FALSE,
  direction = c("next", "prev"),
  password = NULL
)

Arguments

doc: A pdfium_doc from pdf_doc_open(), or a character path.
query: Single non-empty character string to find. Encoded to UTF-16LE before being handed to PDFium; any character representable in UTF-8 works (including supplementary-plane code points via surrogate pairs).
case_sensitive: If TRUE, only exact-case matches are returned. Default FALSE (case-insensitive ASCII letters; PDFium does not promise case folding for non-ASCII letters).
whole_word: If TRUE, the match must be bounded by word-break characters (whitespace / punctuation) on both sides. Default FALSE.
consecutive: If TRUE, after a match the next search resumes immediately after the match end; if FALSE (default), PDFium skips ahead by one character before searching again, so overlapping matches are not reported.
direction: Search direction: "next" (default) walks from page start to end via FPDFText_FindNext; "prev" walks from end to start via FPDFText_FindPrev. Results are returned in iteration order, so reverse search produces matches in right-to-left / bottom-to-top order on each page.
password: Optional password for encrypted PDFs when doc is a path. Ignored when doc is already an open pdfium_doc.

Value

A tibble with one row per match and columns:

page (integer, 1-based)
match_index (integer, 1-based within page)
start_char (integer, 0-based character offset on the page)
char_count (integer, number of characters in the match)
text (character, the matched substring, UTF-8)
left, bottom, right, top (numeric, axis-aligned union of the matched characters' bounding boxes in PDF user-space points; NA when PDFium reports no bounds, which can happen for glyphs without a positioned origin)

The tibble has zero rows when no matches are found. Column types are stable across the zero-row and non-zero-row cases.

Details

Match indexing is character-based: PDFium's text page is an indexable sequence of glyph-derived characters in reading order, and start_char is the 0-based offset of the first matched character on that page. The same offset can be cross-referenced against pdf_text_chars() output if you need per-character bounds rather than per-match bounds.

Multi-line matches (where the matched text wraps across lines) are reported as a single row whose bounding box is the axis-aligned union of every contributing character's bounding box. If you need one rectangle per line for highlighting, expand each row by iterating pdf_text_chars() over start_char:(start_char + char_count - 1).

Examples

fixture <- system.file("extdata", "fixtures", "unicode.pdf",
  package = "pdfium"
)
if (nzchar(fixture)) {
  pdf_text_search(fixture, "Hello")
  pdf_text_search(fixture, "WORLD", case_sensitive = FALSE)
}
#> # A tibble: 1 × 9
#>    page match_index start_char char_count text   left bottom right   top
#>   <int>       <int>      <int>      <int> <chr> <dbl>  <dbl> <dbl> <dbl>
#> 1     1           1          7          5 world  129.   137.  158.  146.