Search code examples
apache-pig

How can I generate schema from text file? (Hadoop-Pig)


Somehow i got filename.log which looks like for example (tab separated)

Name:Peter Age:18

Name:Tom Age:25

Name:Jason Age:35

because the value of key column may differ i cannot define schema when i load text like

a = load 'filename.log' as (Name:chararray,Age:int);

Neither do i want to call column by position like

b = foreach a generate $0,$1;

What I want to do is, from only that filename.log, to make it possible to call each value by key, for example

a = load 'filename.log' using PigStorage('\t');

b = group b by Name;

c = foreach b generate group, COUNT(b);

dump c;

for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below

public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
    @Override
    public Tuple exec(Tuple input){
        TupleFactory mTupleFactory = TupleFactory.getInstance();
        ArrayList<String> mProtoTuple = new ArrayList<String>();
        Tuple output;
        String target=input.toString().substring(1, input.toString().length()-1);
        String[] tokenized=target.split(",");
        try{
            for(int i=0;i<tokenized.length;i++){
                mProtoTuple.add(tokenized[i].split(":")[1]);
            }
            output =  mTupleFactory.newTupleNoCopy(mProtoTuple);
            return output;
        }catch(Exception e){
            output =  mTupleFactory.newTupleNoCopy(mProtoTuple);
            return output;
        }
    }
}

How should I alter this method to get what I want? or How should I write other UDF to get there?


Solution

  • Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:

    {(Name,Foo),(Age,Bar)}
    {(Age,25),(Name,Jim)}
    {(Name,Bob)}
    {(Age,30),(Name,Roger),(Hair Color,Brown)}
    {(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
    

    However, it sounds like you really want a map:

    myudf.py

    @outputSchema('M:map[]')
    def mapize(the_input):
        out = {}
        for kv in the_input.split(' '):
            k, v = kv.split(':')
            out[k] = v
        return out
    

    myscript.pig

    register '../myudf.py' using jython as myudf ;
    
    A = LOAD 'filename.log' AS (total:chararray) ; 
    B = FOREACH A GENERATE myudf.mapize(total) ;
    
    -- Sample usage, grouping by the name key.
    C = GROUP B BY M#'Name' ;
    

    Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.