I have a requirement to read a hive db table and write that information in text format in EBCDIC encoding as that will be used as an input to a mainframe process. We are using the cp037 encoding. To do that i am reading the hive table into a df and then calling a custom udf to convert the numeric value into COMP3 format(used to store numeric value in Mainframe) and then decoding them using cp037 for mainframe EBCDIC charset. Problem is while writing into a text file or displaying the value some extra junk characters are getting added.
First i made sure that the custom udf is working fine. For that i encoded back the output of the custom udf using cp037 and the below output is coming which is perfect and expected. newcol1 here is the MF comp3 equivalent of the field1 and newcol2 is the MF COMP3 equivalent of the field2.
root
|-- field1: long (nullable = true)
|-- newcol1: binary (nullable = true)
|-- field2: long (nullable = true)
|-- newcol2: binary (nullable = true)
+---------------+----------------------+---------------------------+----------------+
|field1 |newcol1 |field2 |newcol2 |
+---------------+----------------------+---------------------------+----------------+
|2023097000 |[00 02 02 30 97 00 0C]|320264 |[00 03 20 26 4F]|
|2023097000 |[00 02 02 30 97 00 0C]|343012 |[00 03 43 01 2F]|
|2023100000 |[00 02 02 31 00 00 0C]|343012 |[00 03 43 01 2F]|
|2023100000 |[00 02 02 31 00 00 0C]|320264 |[00 03 20 26 4F]|
+---------------+----------------------+---------------------------+----------------+
But when i am trying to write/display the same after doing decoding junk charcters are coming in
after doing decoding
root
|-- field1: long (nullable = true)
|-- newcol11: string (nullable = true)
|-- field2: long (nullable = true)
|-- newcol21: string (nullable = true)
+---------------+--------+---------------------------+--------+
|field1 |newcol11|field2 |newcol21|
+---------------+--------+---------------------------+--------+
|2023097000 |^@^B^B~Pp^@^L |320264 |^@^C~@^W| |
|2023097000 |^@^B^B~Pp^@^L |343012 |^@^C?^A^G |
|2023100000 |^@^B^B~Q^@^@^L |343012 |^@^C?^A^G |
|2023100000 |^@^B^B~Q^@^@^L |320264 |^@^C~@^W| |
+---------------+--------+---------------------------+--------+
The Mainframe file is like below (Note that i have put HEX ON to show the actual nibbiles)
********************************
-------------------------------
2023097000...Ì.p..320264..Ì..|
FFFFFFFFFF00073900FFFFFF00722444
20230970000228070C32026403806F00
-------------------------------
If you notice the byte X'78' is the junk character that is getting added.
the pyspark file write command i am using is df.coalesce(1).write.format('text').option("encoding","cp037").mode('overwrite').save('/some_location/test/comp3_outputdata/')
Seems like while writing into text format it is not supporting the encoding option.
According to the below link it seems like pyspark doesnt support encoding while calling text method.
Pass encoding option in Pyspark text method
i can't afford to have any other format as my output file will be directly picked up by a mainframe process. Can somebody please help me with this problem ??
In solving this particular problem statement we have encountered multiple issues and as such writing here a detail process that we followed to resolve the issue for any future reference on this
https://www.ibm.com/docs/en/cpl?topic=zos-mbcs-conversion-during-microsoft-windows-copy
Problem 2 -> In code page cp037, X'15' and X'25' are interchangeable both indicating NewLine. The issue that we faced is that if there is a hex byte of x'15' as part of a comp3 field that value was getting change to x'25'. We did a lot of research and seems like there is no straightforward way to solve it and it's kind of bug in this particular codepage. We discussed with IBM on this and they suggested us to use the code page IBM1047. Note @Hogstrom mentioned this already in his response but somehow i didn't notice that. Note the entire file encoding was still utf8 but the destination encoding is changed from cp037 to cp1047.
Problem 3 -> This time we noticed that in each line some extra white spaces are getting added. There is no definite logic for that...for example there was no extra space being added for 2.7 but for 2.5 extra character is getting added. This resulted into a layout issue for us though all the data related issues are resolved. To solved this we forcefully encoded the entire pyspark output file from utf8 to cp1047. Since in pyspark i was writing in text mode and encoding is not supported in that API we have written a separate python code all together for that
Problem 4 -> At this point, we noticed that we are getting an extra character x'15' is being added at the end of each rec. This was obvious as the pyspark text write API uses \n as default line-seperator. To resolve this we have used a custom line separator and in the following python code used rstrip to remove that custom line-seperator. this fixed the issue
Problem 5 -> Finally in the MBCS sft profile setting, the output file will always be on VB mode while in our case we need to support both VB and FB. As such we have removed the MBCS SFT profile and used simple SFT profile with mode=binary. That satisfied the requirement and created perfect file with perfect layout.
I hope the above steps will help anybody with anytype of issue they might face working on similar requirements.