Search code examples
hadoopmapreducehadoop2

Record Reader Split to convert Fixed Length to Delimited ASCII file


I have a file which is of 128 MB so it is splitted into 2 blocks (Block size =64 MB ). I am trying to convert a Fixed Length File to a Delimited ASCII File using Custom Record Reader class

Problem:

When the first split of the file is processed I am able to get the records properly when I see with a hive table on top of the data it is also accessing data node2 to fetch characters until the end of the record. But, the second split is starting with a \n character and also the number of records is getting doubled.

Ex: 
First Split: 456   2348324534         34953489543      349583534
Second Split:
456         23           48324534             34953489543      349583534

As part of the record reader inorder to skip the characters which is read in the first input split the following piece of code is added

FixedAsciiRecordReader(FileSplit genericSplit, JobConf job) throws IOException {
if ((start % recordByteLength) > 0) {
              pos = start - (start % recordByteLength) + recordByteLength;
           }
           else {
              pos = start;
           }

           fileIn.skip(pos);
}

The Input Fixed Length file has a \n character at the end of each record.

Should Any value be set to the start variable as well?


Solution

  • I found the solution to this problem, i have a variable length header in my Input fixed length file which was not skipped, so the position was not exactly starting at the beginning of a record instead it was starting at position (StartofRecord - HeaderLength). This made each record to read a few characters(as much as the headerlength) from the previous record.

    Updated Code:

     if ((start % recordByteLength) > 0) {
            pos = start - (start % recordByteLength) + recordByteLength + headerLength;
        }
        else {
            pos = start;            
        }
    
        fileIn.skip(pos);