Search code examples
pdf.jspdfjs-dist

PDFJS losing check marks on pdf forms that are converted to text


I have been using an adaptation of code from these posts:

PDF to Text extractor in nodejs without OS dependencies

pdfjs: get raw text from pdf with correct newline/withespace

to convert pdfs to text:

import pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js';

import {
    TextItem,
    DocumentInitParameters,
} from 'pdfjs-dist/types/src/display/api';

const getPageText = async (pdf: pdfjsLib.PDFDocumentProxy, pageNo: number) => {
    const page = await pdf.getPage(pageNo);
    const tokenizedText = await page.getTextContent();
    var textItems = tokenizedText.items;
    var finalString = '';
    var line = 0;

    // Concatenate the string of the item to the final string
    for (var i = 0; i < textItems.length; i++) {
        if (line != (textItems[i] as TextItem).transform[5]) {
            if (line != 0) {
                finalString += '\r\n';
            }

            line = (textItems[i] as TextItem).transform[5];
        }
        var item = textItems[i];

        finalString += (item as TextItem).str;
    }
    return finalString;
};

export const getPDFText = async (
    data: string,
    password: string | undefined = undefined
) => {
    const initParams: DocumentInitParameters = {
         data: Buffer.from(data, 'base64'),
        //useSystemFonts: true,
        //disableFontFace: false,
        standardFontDataUrl: 'standard_fonts/'
    };

    if (password !== undefined) {
        initParams.password = password;
    }

    const pdf = await pdfjsLib.getDocument(initParams).promise;
    const maxPages = pdf.numPages;
    const pageTextPromises = [];
    for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
        pageTextPromises.push(getPageText(pdf, pageNo));
    }
    const pageTexts = await Promise.all(pageTextPromises);
    const joined = pageTexts.join(' ');
    return joined;
};

With version 3.1.81 of pdfjs-dist this looks pretty good, but checkboxes on form fields are lost and text field's values show up at the end of each page instead of remaining in context. I feel like this page: https://pdftotext.com/ uses pdfjs based on similarities with my output, but they get the checks on the boxes and their text field "answers" are by the question.

Run with:

import { join } from 'path';
import { readFileSync } from 'fs';

const rawContents = readFileSync(join('directory', 'file.pdf'), 'base64');

const pdfText = await getPDFText(rawContents as string);

Anyone have an idea why I am losing the checks (the boxes are there)?

Sample of what I get:

22. when something something?
☐ 0-3 months ago
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don't know

here is what that webpage gets:

22. when something something?

✔ 0-3 months ago
☐
☐ 4-6 months ago

☐ 7-12 months ago

☐ 13-18 months ago

☐ 19-24 months ago

☐ 25-60 months ago

☐ I don’t know

Again, my output looks like theirs but has lost these checks. I don't know for sure they use pdfjs but i think they do.

Note that I have downloaded a put a couple fonts in the standard_fonts directory. Should I copy them all even if I see no warning message?


Solution

  • In forms Check Boxes are a field boundary not part of any nearby text (true of all fields they are not directly connected to their description), they simply have a name and value. Here Check Box1 & Box2 are placed and Box3 is awaiting surface appearance.

    NOTE especially they are not of fixed appearance they morph when displayed they are chimera looking like they are present.

    enter image description here

    In these AcroForm cases they have no native plain text equivalence, there is nothing to detect the index is simply pointing to page co-ordinates.

    PDF.js is a PDF2HTML converter so can easily display those indexed areas as html fields. Note it's an X.

    enter image description here

    In terms of PDF extractable surface there is no text, and we can see for the boxes above and below there is only a description as seen alongside those radio boxes

    Note it's a tick. Nothing differs except the displayer (viewer)

    enter image description here

    If we try to extract text using PDF.js (here in browser) we get just the text

    enter image description here

    In some cases where Symbol or ZapfDingbats native fonts or other TTF with those code points have been embedded and adapted for state it may be possible to get a fonted checkmark symbol but it is rare, except when designed especially.

    ☐ as you see in your case then to replace with one
    ☑ is picking the correct one from font and add as
    ☒ replacement it's not very easy but doable.

    So the above symbols via HTML print as PDF may be extracted again as here using simple pdftotext or Python.

    enter image description here

    ☐ as you see in your case then to replace with one
    ☑ is picking the correct one from font and add as
    ☒ replacement it's not very easy but doable.