Search code examples
python-3.xpandasebcdic

Unable to read a mainframe generated file in Python


I am trying to read a mainframe generated file using the following code: The end goal is to load it into a dataframe.

import codecs

with open(r'C:\Users\743622\Downloads\UC005\USA.CKN.D050920', "rb") as ebcdic:
    ascii_txt = codecs.decode(ebcdic, "cp500")
    print(ascii_txt)

I get the following error in doing so:

----> 5     ascii_txt = codecs.decode(ebcdic, "cp500")
      6     print(ascii_txt)
      7 

TypeError: decoding with 'cp500' codec failed (TypeError: a bytes-like object is required, not '_io.BufferedReader')

I am also trying to add a glimpse of what the input file looks like in notepad++.Snapshot of data in notepad++


Solution

  • The data shown in the Notepad++ screen shot show a lot of NUL characters, i.e. x'00' values. This indicates that the mainframe file does not consist of printable character only, but also contains binary data. This is quite normal with mainframe files.

    I have transformed the first few characters to EBCDIC

    ÃÒÕ.....€..".ÆÅÙÙÅÙÖ@ÙÁÆÆÁÅÓÓÖ
    

    The sequence '.....' is NULNULNULEOTNUL, and the sequence '..".' is NULNUL"SI.

    The result is

    'CKN' '0000000400800000220F'x 'FERRERO RAFFAELLO'
    

    That is three characters, followed by a 10 byte binary sequence, followed by another character string.

    My point is, you cannot simply transform this file from ASCII to EBCDIC. You need to understand which bytes are EDCDIC characters, and which bytes are binary data. For the latter, you also need to understand this binary data in more detail. The provider of this mainframe file should be able to tell you how the records are built.