The challenges of PDF text extraction

MonkeybreadSoftware · July 11, 2025, 7:43am

Since FileMaker 2025, you have a quick and easy function to extract text from PDF files in containers:

GetTextFromPDF ( container )

You can call it and quick and easy get some text. Great, but what runs under the hood?

For macOS and iOS we expect Claris to use PDFKit functions built-in to macOS. So it should return the same as our PDFKit.GetPDFText function.

For Windows and Linux, the system doesn't come with a PDF library, so FileMaker brings pdfium library. This is the one used by Chromium and it can do text extraction.

Since Claris uses a different libraries, you may not get the exact same output on both. There will be differences in the text, like where spaces are added and the order of text fragments.

If you need more control about text extraction, please check out our DynaPDF and PDFKit functions in MBS FileMaker Plugin.

There are a few things difficult for PDF text extraction:

Spaces In the PDF pages, there are no codes for spaces. They simply don't exist in the PDF. But whenever you see a distant between two words or two letters, you need to calculate the distance and figure out how many spaces you need to insert in the extracted text. See our DynaPDF.SetSpaceWidthFactor function.
Order The text fragments in a PDF don't need to follow any particular order. The software creating the PDF may usual put text from top to bottom (or bottom to top) on the page and then later add more text like page numbers. Every PDF library doing text extraction must apply an algorithm to sort text fragments into lines and a text.
Font Encoding Fonts may come with encodings, that the PDF library can't decode. Then we get either no text or glibberish.
DynaPDF can load CMap files to provide encodings for Chinese, Japanese or Korean to decode these characters properly.
Scanned PDFs Some PDFs are scanned and only contain images. DynaPDF can detect if a page contains no text, but one image and extract the image, so you can use our OCR functions on them.
Text as Vector Graphics With DynaPDF Pro, we can convert text in a PDF to vector drawing. Originally made to preseve exotic fonts when sending a PDF to a printer, which can't draw the font directly. If the PDF contains text as vector graphics, you need to render the page and do OCR.
Text on a curve Any non horizontal text can be very difficult in text extraction. Like text on a curve or slanted text may show up as multiple lines.
Multiple column text If the PDF has texts layouts with multiple columns or with tables, the text extraction may not pick up the right order for text fragments.
Layer PDF document may have multiple layers, where some are visible while others are invisible and maybe activate only with JavaScript. If text from multiple layers is mixed in the text extraction, you may have funny results.
Broken PDFs A lot of PDF documents contain structural errors. DynaPDF has a couple of repair features, that automatically fix common issues we see.

Please try the GetTextFromPDF function and compare it to DynaPDF's capabilities. We can do some more things:

Extract text of the whole document.
Extract text of an individual page.
Extract text of a rectangle area on a page.
Find text on the PDF and get the position on the page.
Load CMap files to cover CKJ characters.
Detect and remove overlapping text.
Decide the sorting algorithm with various flags
Pass a password to open an encrypted PDF file

We tried a couple of PDFs and found differences between FileMaker on macOS and Windows as well as between DynaPDF and FileMaker. One problem was that for some reason the "-" in a PDF was changed to Char(2)!? Then we see that whether to put a space between two words can vary between all three libraries.

If you have questions, please let us know. Feel free to try DynaPDF with MBS FileMaker Plugin without a license.

FabriceN · July 11, 2025, 4:30pm

Great post. Thanks!

Topic		Replies	Views
DynaPDF Parser for FileMaker MBS Plugins pdf , filemaker , plugin , dynapdf	0	108	January 23, 2024
Things you can do with DynaPDF MBS Plugins scripting , pdf , calculations , filemaker	1	520	July 3, 2021
Page Layouting in DynaPDF MBS Plugins filemaker , dynapdf , plugins , mbs	1	105	April 8, 2024
New in MBS FileMaker Plugin 14.0 MBS Plugins filemaker , plugin , mbs	1	134	January 24, 2024
Import individual PDF pages? Questions pdf	3	211	January 25, 2023

The challenges of PDF text extraction

Related topics