Skip to main content

What this tool does

The PDF Text Extractor reads PDF files from an Airtable attachment field and writes their text content to a text field. It handles both digital PDFs (with embedded text) and scanned PDFs (image-based, requiring OCR).

Settings reference

Source

SettingDescription
Source tableThe table containing PDF attachments.
PDF attachment fieldThe attachment field containing PDF files.
Output text fieldA Single Line Text or Multiline Text field where extracted text is written.
Enable OCRUse Tesseract.js OCR for scanned or image-based PDFs.

Output settings

SettingDescription
Include page markersAdd --- Page N --- separators between pages.
Character limit handlingTruncate at 100K characters (Airtable’s limit) or Split across multiple fields.
Overflow fieldsAdditional text fields for the split strategy. Only shown when split mode is selected.
Multiple PDFs handlingWhen a record has more than one PDF: Combine text from all, or Extract first only.

Extraction settings

SettingDescription
Extract specific page rangeToggle to limit extraction to certain pages.
Start pageFirst page to extract (1-indexed). Only shown when page range is on.
End pageLast page to extract. Only shown when page range is on.
OCR languageLanguage for OCR: English, Spanish, French, German, Chinese (Simplified), Japanese, or Arabic. Only shown when OCR is on.
Generate keyword indexExtract the top 20 keywords by frequency from the text.
Keyword fieldA text field to store the extracted keywords. Only shown when keyword extraction is on.

Record filters

SettingDescription
Source viewLimit to records in a specific view.
Skip already processed recordsSkip records where the output field already has content.

How the tool detects scanned PDFs

The tool automatically checks each page for text content. If a page contains fewer than 50 characters, it’s treated as a scanned/image page and OCR is applied (if enabled). Digital pages with text are extracted directly without OCR.

Handling the 100K character limit

Airtable text fields have a 100,000 character limit. If a PDF’s extracted text exceeds this:
  • Truncate at 100K: Text is cut at the limit. The remaining content is lost.
  • Split across multiple fields: Text is split at line boundaries across multiple fields you specify. This preserves all content.

Common questions

No. Password-protected PDFs are skipped with a warning in the execution log.
Corrupted or invalid PDF files are skipped with an error in the execution log.
Accuracy depends on the scan quality, resolution, and language. High-quality scans with clear text produce good results. Low-quality or handwritten scans may have reduced accuracy.
Yes. Once the text is in a text field, you can search and filter by it in Airtable’s grid view.