What this tool does
The PDF Text Extractor reads PDF files from an Airtable attachment field and writes their text content to a text field. It handles both digital PDFs (with embedded text) and scanned PDFs (image-based, requiring OCR).Settings reference
Source
| Setting | Description |
|---|---|
| Source table | The table containing PDF attachments. |
| PDF attachment field | The attachment field containing PDF files. |
| Output text field | A Single Line Text or Multiline Text field where extracted text is written. |
| Enable OCR | Use Tesseract.js OCR for scanned or image-based PDFs. |
Output settings
| Setting | Description |
|---|---|
| Include page markers | Add --- Page N --- separators between pages. |
| Character limit handling | Truncate at 100K characters (Airtable’s limit) or Split across multiple fields. |
| Overflow fields | Additional text fields for the split strategy. Only shown when split mode is selected. |
| Multiple PDFs handling | When a record has more than one PDF: Combine text from all, or Extract first only. |
Extraction settings
| Setting | Description |
|---|---|
| Extract specific page range | Toggle to limit extraction to certain pages. |
| Start page | First page to extract (1-indexed). Only shown when page range is on. |
| End page | Last page to extract. Only shown when page range is on. |
| OCR language | Language for OCR: English, Spanish, French, German, Chinese (Simplified), Japanese, or Arabic. Only shown when OCR is on. |
| Generate keyword index | Extract the top 20 keywords by frequency from the text. |
| Keyword field | A text field to store the extracted keywords. Only shown when keyword extraction is on. |
Record filters
| Setting | Description |
|---|---|
| Source view | Limit to records in a specific view. |
| Skip already processed records | Skip records where the output field already has content. |
How the tool detects scanned PDFs
The tool automatically checks each page for text content. If a page contains fewer than 50 characters, it’s treated as a scanned/image page and OCR is applied (if enabled). Digital pages with text are extracted directly without OCR.Handling the 100K character limit
Airtable text fields have a 100,000 character limit. If a PDF’s extracted text exceeds this:- Truncate at 100K: Text is cut at the limit. The remaining content is lost.
- Split across multiple fields: Text is split at line boundaries across multiple fields you specify. This preserves all content.
Common questions
Can it extract text from password-protected PDFs?
Can it extract text from password-protected PDFs?
No. Password-protected PDFs are skipped with a warning in the execution log.
What if a PDF is corrupted?
What if a PDF is corrupted?
Corrupted or invalid PDF files are skipped with an error in the execution log.
How accurate is the OCR?
How accurate is the OCR?
Accuracy depends on the scan quality, resolution, and language. High-quality scans with clear text produce good results. Low-quality or handwritten scans may have reduced accuracy.
Can I extract from PDFs and then search the text in Airtable?
Can I extract from PDFs and then search the text in Airtable?
Yes. Once the text is in a text field, you can search and filter by it in Airtable’s grid view.