Search code examples
pdfpdfclown

Empty whitespace conversion in PDFClown


I'm having an issue when using the TextExtractor class in PDFClown, with occurrences of empty whitespace also known as a "discretionary newline". These characters are embedded randomly but ignored in Acrobat Reader. So, lines where these characters exist will show as a single line in Acrobat, but are broken into many lines when the text is extracted, if I specify '\n' as the newline character in TextExtractor.ToString(...).

It appears that PDF clown simply takes any whitespace character and converts it into a single space, or ' '. Is there a way to bypass this conversion, so that the original character is extracted instead?


Solution

  • After more research, it appears that the PDFClown library is very buggy. There are several issues:

    • Converts most forms of space character to a single normal space character.
    • Inserts spaces instead of newlines.
    • If you attempt to use the provided overrides to insert your own character for spaces or newlines, the internal mappings of characters in the extracted array to boxes for each individual character gets destroyed.
    • Cannot properly decode all embedded fonts.
    • Since it cannot properly decode embedded fonts, it will silently omit characters from extracted text.
    • Cannot reliably handle ligatures or decomposition of ligatures. Often silently dropped altogether from extracted text.

    To come directly to the issue I had, you can detect and remove these "false" whitespace characters by checking their bounding rectangle to see if they overlap other non-whitespace characters, but given all the other issues with the library, my advice to use use PDFBox instead.

    If you're using .NET and you'd like to use PDFBox, you can use Tika On Dot Net which is the Apache Tika project brought over to .NET via IKVM.

    Apache Tika is a collection of other libraries, include PDFBox. Tika On Dot Net currently has PDFBox 1.8.10 and also has a Nuget package to make adding to your project easy.

    I had a project go 1.5 weeks over deadline because all of these issues were discovered half way through, which required a full rewrite. Just a heads up.