Search code examples
apache-pig

Pig : parsing line with blank delimiter


I'm using Hadoop Pig (0.10.0) to process logs file, a log line looking like :

2012-08-01  INFO   (User:irim)   getListedStocksByMarkets completed in 7041 ms

I would like to get a relation with tokens split by blanks, that is :

(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)

Loading that data with statement :

records = LOAD 'myapp.log' using PigStorage(' ');

did not achieve that because my tokens can be separated by several white space leading to several empty tokens. PigStorage does not seem to support regexp delimiter (or at least I haven't succeeded configuring it that way).

So my question : what would be the best way to get those tokens ?

If I could remove empty elements from a relation I would be happy, is possible to do that with Pig ?

For example starting from :

(2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms)

To get

(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)

I'm trying another approach with TextLoader then TOKENIZE but I'm not sure it's the best strategy. Maybe a User Load Function is a more natural choice ...

Regards,

Joel


Solution

  • You can use built in function STRSPLIT with regular expression to break a line into a tuple. Here is a script for your particular example with comma as a separator:

    inpt = load '~/data/regex.txt' as (line : chararray);
    dump inpt;
    -- 2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms
    
    splt = foreach inpt generate flatten(STRSPLIT(line, ',+'));
    dump splt;
    -- (2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)