Search code examples
hadoopmergeapache-pig

How to Merge Maps in Pig


I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example

Record A: {"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}

Record B: {"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}} I want to merge the records on Name such that:

Merged: {"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}

UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?

Something like:

A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];

merged = MERGE_ON infoMap#Name, A, B;

Solution

  • Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:

    A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
    B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
    
    A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
    B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
    
    AB = JOIN A2 BY joinKey, B2 BY joinKey;
    merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
    

    Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.