Search code examples
pythonhexasciimainframeebcdic

Convert EBCDIC file to ASCII using Python 2


I need to convert the EBCDIC files to ASCII using python 2.

The sample extract from the sample file looks like the below (in notepad++)

enter image description here

I have tried to decode it with 'cp500' and then encode it in 'utf8' in python like below

with open(path, 'rb') as input_file:
    line = input_file.read()
    line = line.decode('cp500').encode('utf8').strip()
    print line

And below

with io.open(path, 'rb', encoding="cp500") as input_file:
    line = input_file.read()
    print line

Also, tried with codecs

with codecs.open(path, 'rb') as input_file:
    count = 0
    line = input_file.read()
    line = codecs.decode(line, 'cp500').encode('utf8')
    print line

Also, tried importing/installing the ebcdic module, but it doesn't seem to be working properly. here is the sample output for the first 58 chars

enter image description here

It does transform the data to some human-readable values for some bytes but doesn't seem to be 100 percent in ASCII. For example, the 4th character in the input file is 'P' (after the first three NUL), and if I open the file in hex mode, the hex code for 'P' is 0x50, which maps to character 'P' in ASCII. But the code above gives me the character '&' for this in output, which is the EBCDIC character for hex value 0x50.

Also, tried the below code,

with open(path, 'rb') as input_file:
    line = input_file.read()
    line = line.decode('utf8').strip()
    print line

It gives me the below error.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 4: invalid continuation byte

And If I change the 'utf8' to 'latin1' in the above code, it generates the same output as in the input file shown above which was opened in the notepad++.

Can anyone please help me with how to transform the EBCDIC to ASCII correctly?

Should I build my own mapping dictionary/table/map to transform the EBCDIC to ASCII i.e. convert the file data in hex codes and then get the corresponding ASCII char from that mapping table/dict? If I do so, then hex 0x40 is 'Space' and 0xe2 is 'S' in EBCDIC but in ASCII 0x40 is '@' and 0xe2 doesn't have the mapping in the ASCII. But as per the input data, it looks like I need EBCDIC characters in this case. So should I construct some map by looking at the input data and decide wheater I want EBCDIC or ASCII character for some particular hex value and construct that map accordingly for lookup?

Or I need to follow some other way to correctly parse the data.

Note:- The non-alphanumeric data is needed as well, there are some images at some particulars places in the input file encoded in that non-alphanumeric/alphanumeric chars, which we can extract, so not sure if I need to convert that to ASCII or leave as its.

Thanks in advance


Solution

  • Posting for others how I was able to transform the EBCDIC to ASCII.

    I learned that I only needed to convert the non-binary alpha-numeric data to ASCII from EBCDIC. To know which data will be non-binary alphanumeric data, one needs to understand the format/structure of the EBCDIC/input file. Since I knew the format/structure of the input file, I was aware of which fields/bytes of the input files needed transformation and did transform only those bytes leaving other binary data as it is in the input file.

    Earlier I was trying to convert the whole file into ASCII, which was converting the binary data as well, hence distorting the data in conversion. Hence, by understanding the structure/format of the files I converted only the required alphanumeric data to ASCII and processed it. It worked.