I'd like to read a data file in Pig that uses a multi-character delimiter to separate fields (I've no requirement to write files this way). So my Pig Script will look something like:
myData = LOAD 'myFile' USING PigStorage(‘~|~’) as (col1:chararray, col2:chararray);
My issue is that PigStorage doesn't support multi-character delimiters.
Possible solutions are:
With respect to the second point, I've seen the much copied pig.apache.org example, but the trouble is that this code won't compile (aside from the obvious syntax error, all the import statements are missing so I don't know which version of classes need to be imported!)
If you know how many fields to expect, you could use org.apache.pig.piggybank.storage.MyRegExLoader
1
But you need to write a regex that can parse the entire line, so it's not as convenient as PigStorage
.