Search code examples
c#pythonunicodeutf-8ligature

C# / Python Encoding difference


Basically I am doing some conversions of PDF's into text, then analyzing and clipping parts of that text using a library in Python. The Python "clipping" doesn't actually cut the text into separate files it just has a start character and end character position for string extraction. For example:

the quick brown fox jumped over the lazy dog

My python code might cut out "quick" by specifying 4 , 9. Then I am using C# for a GUI program and try to take these values assigned by Python, and it works... for the most part. It appears the optical character recognition program that turned the pdf into a text file included some odd UTF characters which will change the counts on the C# side.

The PDF-txt conversion odd characters characters include a "fi" character, instead of an "f" and "i" character (possibly other characters too, they are large files.) Now this wouldn't be a problem, except C# says this is one character and Python (as well as Notepad++) consider this 3 character positions.

C#: "fi" length = 1 character.

Python/Notepad++: "fi" length = 3 characters.

What this ends up doing is giving me an offset clip due to the difference of character counts. Like I said when I run it in python (linux) and try outputting the clipping its perfect, and then I transferred the text file to Windows and Notepad++ confirms they are the correct positions. C# really just counts the "fi" as one character and Notepad++ as well as Python count it as 3 characters for some reason.

I need a way to bridge this discrepancy from the Python side OR the C# side.


Solution

  • You have to distinguish between characters and bytes. utf8 is a character encoding, where one character can have up to 4 bytes. So notepad++ displays probably byte positions, where Python can work with both byte and character strings. In C# probably have read the file as text file, which also produces character strings.

    To read character strings in python use:

    import codecs
    with codecs.open(filename, encoding="utf-8") as inp:
        text = inp.read()