I'd like to load a lot of small files from HDFS with Pig and process them as tuples (filename, filecontent).
a=LOAD 'mydir' USING PigStorage('','-tagPath') AS (filepath:chararray, filecontents:chararray);
However it seems like I cannot omit specifying the delimiter. Is there some sort of a "NULL" in Pig or is there any other way to make sure the content of the file will not be split?
You will have to write your own custom loader by extending LoadFunc
.
Short answer to your question is no.In order to make sure the content is not split,use a delimiter that would not exist in the content.In that way, the whole content would be loaded to the field filecontents:chararray
.So assuming,your input files do not have a special character '~'
a=LOAD 'mydir' USING PigStorage('~','-tagPath') AS (filepath:chararray, filecontents:chararray);