Search code examples
asciigoogle-cloud-dataflowapache-beamebcdicjrecord

Convert EBCDIC to ASCII in Apache Beam


I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam.

Code that I'm using:

CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader  = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader.SPLIT_NONE, copybookname, cobolfilename);

The code reads and converts the file as required. I am able to read the cobolfilename and copybookname only from the local system which are basically paths of the EBCDIC file and the copybook respectively. However, when I try to read the files from GCS, it fails with FileNotFoundException – “The filename, directory name, or volume label syntax is incorrect” .

Is there a way to read Cobol file(EBCDIC) from GCS using CobolIoProvider class ?

If not, is there any other class available to convert Cobol file(EBCDIC) to ASCII and allowing the files to be read from GCS.

Using ICobolIOBuilder:-

Code that I’m using:

ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder("copybook.cbl")
                                    .setFileOrganization(Constants.IO_FIXED_LENGTH)
                      .setSplitCopybook(CopybookLoader.SPLIT_NONE);

AbstractLineReader reader = iob.newReader(bs); //bs is an InputStream object of my Cobol file

However, here are a few concerns:-

1) I have to keep my copybook.cbl locally. Is there any way to read copybook file from GCS. I tried the below code, trying to read my copybook from GCS to Stream and pass the stream to LoadCopyBook(). But the code didn’t work.

Sample code below:

InputStream  bs2 = new ByteArrayInputStream(copybookfile.toString().getBytes());
LayoutDetail schema = new CobolCopybookLoader()
                     .loadCopyBook(   bs, " copybook.cbl",
                         CopybookLoader.SPLIT_NONE, 0, "",
                         Constants.USE_STANDARD_COLUMNS,
                         Convert.FMT_INTEL, 0, new TextLog())
                           .asLayoutDetail();

AbstractLineReader reader = LineIOProvider.getInstance().getLineReader(schema);

reader.open(inputStream, schema);

2) Reading the EBCDIC file from stream using newReader didn’t convert my file to ascii.

Thanks.


Solution

  • I do not have a full answer. If you are using a recent version of suggest changing the JRecord code to use the JRecordInterface1. The IO-Builder is a lot more flexible than the older CobolIoProvider interface.

    String encoding = "cp037"; // cp037/IBM037 US ebcdic; cp273 - German ebcdic 
    ICobolIOBuilder iob = JRecordInterface1.COBOL
           .newIOBuilder("CopybookFile.cbl") 
                .setFileOrganization(Constants.IO_FIXED_LENGTH)
                .setFont(encoding);  // should set encoding if you can
    
    AbstractLineReader reader = iob.newReader(datastream);
    

    With the IO-Builder interface you can use streams. This question Stream file from Google Cloud Storage is about creating a stream from GCS, may be useful. Hopefully some one with more knowledge of GCS can help.

    Alternatively you could read from GCS directly and create data-lines(data-records) using the newLine method of a JRecord-IO-Builder:

         AbstractLine l = iob.newLine(byteArray);
    

    I will look at creating a basic Read/Write interface to JRecord so JRecord user's can write there own interface to GCS or IBM's Mainframe Access (ZFile) etc. But this will take time.