I have a large set of data with a field of id
and details
. details
is either a map, or a bag of maps. My end goal is an entry of id, key, value
for every entry in all the maps in the data.
In 0.16
I could use FLATTEN
freely to ensure I just had one map per line, then use a UDF to flatten the maps. But since 0.17
, FLATTEN
works on maps as well. This leads to the situation where after one use, I have some of the data exactly the way I want it, but the rest of the data is still within a map.
Essentially I need to use FLATTEN
once for half the data, and twice for the other half. Is there a way to detect data type within a GENERATE
statement to only flatten the data if it's a map?
To illustrate, given
(ID1, [key1#val1,key2#val2])
(ID2, {[key3#val3, key4#val4]})
I want to generate
(ID1, key1, val1)
(ID1, key2, val2)
(ID2, key3, val3)
(ID2, key4, val4)
You basically need a UDF which will tell you whether the input to it is a valid map or not. With such a UDF you could setup a ternary operation to FLATTEN
only if a particular field is a valid map. Mozilla's Akela has among many other things the exact UDF you are looking for. You can find the Akela open source repository at https://github.com/mozilla-metrics/akela and the UDF of interest at https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/filter/map/IsMap.java
Usage would look similar to the following:
IMPORT <path_to_jar>/akela.jar
DEFINE IsMap com.mozilla.pig.filter.map.IsMap();
data = LOAD '<path_to_data>';
dataFlattened = FOREAH data GENERATE
$0,
IsMap($1) ? FLATTEN($1) : $1;