Pig : parsing line with blank delimiter

I'm using Hadoop Pig (0.10.0) to process logs file, a log line looking like :

2012-08-01  INFO   (User:irim)   getListedStocksByMarkets completed in 7041 ms

I would like to get a relation with tokens split by blanks, that is :

(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)

Loading that data with statement :

records = LOAD 'myapp.log' using PigStorage(' ');

did not achieve that because my tokens can be separated by several white space leading to several empty tokens. PigStorage does not seem to support regexp delimiter (or at least I haven't succeeded configuring it that way).

So my question : what would be the best way to get those tokens ?

If I could remove empty elements from a relation I would be happy, is possible to do that with Pig ?

For example starting from :

(2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms)

To get

(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)

I'm trying another approach with TextLoader then TOKENIZE but I'm not sure it's the best strategy. Maybe a User Load Function is a more natural choice ...

Regards,

Joel

Solution

You can use built in function STRSPLIT with regular expression to break a line into a tuple. Here is a script for your particular example with comma as a separator:

inpt = load '~/data/regex.txt' as (line : chararray);
dump inpt;
-- 2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms

splt = foreach inpt generate flatten(STRSPLIT(line, ',+'));
dump splt;
-- (2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)