I'm using Hadoop Pig (0.10.0) to process logs file, a log line looking like :
2012-08-01 INFO (User:irim) getListedStocksByMarkets completed in 7041 ms
I would like to get a relation with tokens split by blanks, that is :
(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms
)
Loading that data with statement :
records = LOAD 'myapp.log' using PigStorage(' ');
did not achieve that because my tokens can be separated by several white space leading to several empty tokens. PigStorage does not seem to support regexp delimiter (or at least I haven't succeeded configuring it that way).
So my question : what would be the best way to get those tokens ?
If I could remove empty elements from a relation I would be happy, is possible to do that with Pig ?
For example starting from :
(2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms
)
To get
(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms
)
I'm trying another approach with TextLoader
then TOKENIZE
but I'm not sure it's the best strategy.
Maybe a User Load Function is a more natural choice ...
Regards,
Joel
You can use built in function STRSPLIT with regular expression to break a line into a tuple. Here is a script for your particular example with comma as a separator:
inpt = load '~/data/regex.txt' as (line : chararray);
dump inpt;
-- 2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms
splt = foreach inpt generate flatten(STRSPLIT(line, ',+'));
dump splt;
-- (2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)