I am trying to convert a Pig script into a Spark 2 routine.
Within a groupBy
I want to compute the number of elements that match particular states. The PIG code looks like that :
A = foreach (group payment by customer) {
done = filter payment by state == 'done';
doing = filter payment by state == 'doing';
cancelled = filter payment by ETAT == 'cancelled';
generate group as customer, COUNT(done) as nb_done, COUNT(doing) as nb_doing, COUNT(cancelled) as nb_cancelled;
};
I would like to adapt this to a dataframe starting like payment.groupBy("customer")
.
Thanks !
Try out something similar:
Assume the customer table is register in spark session with below schema:
customer.registerTempTable("customer");
sparkSession.sql("describe customer").show();
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| string| null|
| state| string| null|
+--------+---------+-------+
--Group using map
sparkSession.sql("select id, count(state['done']) as done," +
"count(state['doing']) as doing," +
"count(state['cancelled']) as cancelled
from (select id,map(state,1) as state from customer) t group by id").show();