I have next directory structure in HDFS:
logs_folder
|---2021-03-01
|---log1
|---log2
|---log3
2021-03-02
|---log1
|---log2
2021-03-03
|---log1
|---log2
...
Logs are made up of text data. There is no date in the data because it is already in the folder name. I want to read all the logs and save them in the following format:
date id
where id - field from the log, but I need to take the date from the folder name. Expected output:
2021-03-01 id1
2021-03-01 id2
...
2021-03-02 id234
2021-03-02 id456
...
How to add date from folder name to output?
I found close question how to add full pathname to data on reading:
A = LOAD '/logs_folder/*' using PigStorage(',','-tagPath');
DUMP A ;
How can I incorporate the current input filename into my Pig Latin script?
It is very close, but how to get parent folder name only instead of full path?
Finally I used this approach:
Code example:
hadoop_data = LOAD '/logs_folder/*' USING PigStorage(',', '-tagPath') as (filepath:chararray, id:chararray, feature:chararray, value:chararray);
hadoop_data = FOREACH hadoop_data GENERATE id,(chararray)REGEX_EXTRACT(filepath,'.*\\/(.*)\\/',1) as path,
feature,value;
My data consist of 3 fields - id, feature, value, but you can see there are 4 of them - filepath
field was added!