Search code examples
c#asp.netpdfitextsyncfusion

C#(Asp.net) Convert Pdf to txt file? keeping the pdf alignment (spaces and padding in txt file should be same as pdf file)?


itextsharp and pdfbox in both i am able to extract the text character, but there alignment is not same as pdf file alignment,(margin left,top etc)

How can i keep the pdf alignment in txt file also?


Solution

  • As you've experienced when experimenting with both iText and PdfBox, you are asking something that is impossible because of a mismatch between the way the Portable Document Format defines a layout and the way layout is established in the plain text format.

    • In .txt files, alignment, indentation, spacing,... is achieved using white space characters, such as spaces (), newline characters (/n). and tabs (/t).
    • In .pdf files, single space characters are often used in-between words, but when more than one space is needed, or in cases when word-spacing is optimized for a better reading experience, you'll see that absolute positioning is preferred over using space characters. The \n in a content stream isn't perceived as a new line for the content, but the concept of a new line exists through new line operators. The concept of a tab doesn't exist at all in PDF; absolute positioning using (x, y) coordinates is used instead.

    Your expectation that a conversion process from PDF to TXT could somehow solve this syntactical mismatch is endearing, but it starts from an assumption that is totally wrong: you'd need absolute positioning functionality in the plain text format, and that functionality simply isn't there. The answer to your question is that there is no answer.