Search code examples
pythonjsonencodingfontshindi

How to parse Word files with Hindi text in KrutiDev font using Python in json


I'm trying to parse Word files .docx that contain Hindi text written in the KrutiDev 010 font. When I process the text using Python and try to encode it into JSON, I get random gibberish instead of the expected Hindi text.

Here’s what I’ve done so far:

  1. I used the ensure_ascii=False parameter in the json.dump() function to allow Unicode encoding, as Python's JSON encoder uses ASCII by default.

  2. Despite this, the output is still incorrect and doesn't display the Hindi text properly.

I suspect this issue is related to the KrutiDev font. when I put that gibberish in this convertor it give me correct Hindi text:

krutidev to unicode converter

How can I correctly parse and encode the Hindi text in KrutiDev font into Unicode and save in JSON files?

Are there any Python libraries or methods to handle such font-specific encodings effectively?


Solution

  • Specifically for KrutiDev, have a look at Unicode_KrutiDev_converter.py. I am not sure which version of KrutiDev is supported. The mapping seems to change a bit between each version of the font.

    A more generic, but more challenging approach is to use palaso-python, this needs to be installed from github, it's not available on PyPi. There is limited documentation available, other than a requirements file, no information on external dependencies. I am assuming that you need both icu4c and teckit available on your system. The teckit application and libraries are available for Windows, Linux and macOS: https://software.sil.org/teckit/#downloads.

    You will need to download the KrutiDev 010 [mapping files](https://github.com/silnrsi/wsresources/tree/master/scripts/Deva/legacy/kruti-dev-010/mappings. There are two files:

    1. KrutiDev010.map - the source file for the mapping
    2. KrutiDev010.tec - the compiled version of the mapping. This is the one you need.

    To install palaso-python:

    pip install -U git+https://github.com/silnrsi/palaso-python.git
    

    then:

    # import teckit python wrapper
    from palaso.teckit.engine import *
    
    # create mapping with compiled teckit mapping file. I have it in my cwd, specify path to file, as required.
    m = Mapping('KrutiDev011.tec')
    
    # create converters in required directions
    # dec = forward direction (KrutiDev to Unicode)
    # enc = reverse direction (Unicode string to KrutiDev)
    dec = Converter(m); enc = Converter(m, forward=False)
    

    Think of the directions of the converter as encoding and decoding. KrutiDev010 is a legacy encoding to Unicode conversion, the reverse direction is Unicode to KrutiDev.

    So dec will be similar to a bytes.decode() operation, and enc will be equivalent to a str.encode() operation. This means that dec requires a sequence of bytes in the legacy encoding and outputs a Python3 string. While enc takes a Python3 string and outputs a sequence of bytes in the legacy encoding.

    If you have the KrutiDev text as a byte sequence all is fine, if not you need to convert the string to bytes:

    kds = 'fgUnh'
    kdb = kds.encode('latin1')
    

    Then decode the bytes to a Unicode string:

    uni = dec.convert(kdb, finished=True)
    print(uni)
    # हिन्‍दी
    

    A quick note at this point, the encoding conversion errs on the side of caution, it introduces a character that may or may not be required. It doesn't affect the display of the string. The string uni is composed of the codepoints 0939 093F 0928 094D 200D 0926 0940. U+200D (ZERO WIDTH JOINER) isn't needed in this context, but there are times it will be necessary to use.

    In this example it can be stripped out. In data processing it can be systematically replaced based on requirements or context.

    So to globally replace:

    # uni = uni.replace('\u200d', '')
    

    Use enc to convert to the legacy encoding:

    print(enc.convert(uni, finished=True))
    # b'fgUnh'
    print(kdb == enc.convert(uni, finished=True))
    # True
    

    Alternatively if you are on Windows, an easier approach is to convert the word document to Unicode insitu. THe developers of teckit also have a word add-in SIL Converters which can be used to convert the word document itself to Unicode.