Search code examples
apache-pigamazon-emr

How can you make Apache Pig FLATTEN based on data type?


I have a large set of data with a field of id and details. details is either a map, or a bag of maps. My end goal is an entry of id, key, value for every entry in all the maps in the data.

In 0.16 I could use FLATTEN freely to ensure I just had one map per line, then use a UDF to flatten the maps. But since 0.17, FLATTEN works on maps as well. This leads to the situation where after one use, I have some of the data exactly the way I want it, but the rest of the data is still within a map.

Essentially I need to use FLATTEN once for half the data, and twice for the other half. Is there a way to detect data type within a GENERATE statement to only flatten the data if it's a map?

To illustrate, given

(ID1, [key1#val1,key2#val2])
(ID2, {[key3#val3, key4#val4]})

I want to generate

(ID1, key1, val1)
(ID1, key2, val2)
(ID2, key3, val3)
(ID2, key4, val4)

Solution

  • You basically need a UDF which will tell you whether the input to it is a valid map or not. With such a UDF you could setup a ternary operation to FLATTEN only if a particular field is a valid map. Mozilla's Akela has among many other things the exact UDF you are looking for. You can find the Akela open source repository at https://github.com/mozilla-metrics/akela and the UDF of interest at https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/filter/map/IsMap.java

    Usage would look similar to the following:

    IMPORT <path_to_jar>/akela.jar
    DEFINE IsMap com.mozilla.pig.filter.map.IsMap();
    
    data = LOAD '<path_to_data>';
    
    dataFlattened = FOREAH data GENERATE
            $0,
            IsMap($1) ? FLATTEN($1) : $1;