Search code examples
javahadoopapache-pigglob

How to load a specific range of input files in Pig


I have a set of input files to process using Pig, with the following naming structure:

/user/hdp/input/custom/Fold1/train0.txt
/user/hdp/input/custom/Fold1/train1.txt
/user/hdp/input/custom/Fold1/train2.txt
/user/hdp/input/custom/Fold1/train3.txt
...
/user/hdp/input/custom/Fold1/train9.txt
/user/hdp/input/custom/Fold1/train10.txt
/user/hdp/input/custom/Fold1/train11.txt
/user/hdp/input/custom/Fold1/train12.txt
...

up to training file 99. I build my Pig script dynamically as a Java String, which I then submit to my cluster. I am looking for a general solution to load the range of train files from 0 up to some number x, where I can set this x to any java int up to 99.

In a previous version of my solution, that supported values of x up to 9, I used the Pig support for globs in the following way:

pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train[0-"+x+"].txt' USING PigStorage(' ');";

This approach does not scale to values greater than 9, as from 10 it starts to take up two characters instead of one. One potential solution would be splitting x into a single digit and use this to build the pig String.

int tens   = x/10;
int single = x%10;
if(tens>0)
    pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train[0-"+tens+"][0-+"single"+.txt' USING PigStorage(' ');";
else
    pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train[0-"+single+"].txt' USING PigStorage(' ');";

This solutions however has two problems.

  1. When x>9 train 0 to 9 are not loaded because the glob matches the numbers 00, 01 and 02 instead of the single digit versions 0, 1 and 2. I did not see any support in Hadoop globs however for matching the first [0-"+tens+"] part zero or one time (like with ? in regular expressions).
  2. When single is any value smaller than 9, the data files are also only loaded up to this value for all values lower than tens. Lets say x = 24, than the code above only load 10-14, but not 15-19. I did not see anything in the Hadoop glob documentation however to make the second matched digit dependend on the first matched didit.

Does anyone know any generic solution to load my range of data files up to any value of x? I don't know if I'm at the right track using glob's, so any other non-glob solution would also be very much appreciated.

Many thanks in advance!


Solution

  • I looked at hadoop glob signature, and it seems like it should be easy to do than what we thought initially.

    Create a comma separated string of all the numbers that you are interested in and call it expectedNumber. e.g. expectedNumbers = "0,1,2,3,4,5" and then use it as below:

    pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train" + {expectedNumbers} +".txt' USING PigStorage(' ');";
    

    Hope this helps.