Search code examples
apache-sparkapache-pigapache-spark-sql

Pig nested foreach in Spark 2.0


I am trying to convert a Pig script into a Spark 2 routine.

Within a groupBy I want to compute the number of elements that match particular states. The PIG code looks like that :

A = foreach (group payment by customer) {
    done = filter payment by state == 'done';
    doing = filter payment by state == 'doing';
    cancelled = filter payment by ETAT == 'cancelled';
    generate group as customer, COUNT(done) as nb_done, COUNT(doing) as nb_doing, COUNT(cancelled) as nb_cancelled;
};

I would like to adapt this to a dataframe starting like payment.groupBy("customer").

Thanks !


Solution

  • Try out something similar:

    Assume the customer table is register in spark session with below schema:

     customer.registerTempTable("customer");
     sparkSession.sql("describe customer").show();
    
    +--------+---------+-------+
    |col_name|data_type|comment|
    +--------+---------+-------+
    |      id|   string|   null|
    |   state|   string|   null|
    +--------+---------+-------+
    

    --Group using map

    sparkSession.sql("select id, count(state['done']) as done," +
                    "count(state['doing']) as doing," +
                    "count(state['cancelled']) as cancelled 
                  from (select id,map(state,1) as state from customer) t group by id").show();