I am R programmer moving into the pyspark world and have gotten a lot of the basic tricks down but something I am still struggling about is things I would do applys or basic for loops for.
In this case I am trying to calculate the "anti-groupby" for an ID. Basically the idea is to look at a population for that ID and then the population for not this ID and have both those values on the same row. The getting the population for that ID is easy using a groupby and then joining it to a dataset with new_id as the only column.
This is how I would do it in R:
anti_group <- function(id){
tr <- sum(subset(df1, new_id!=id)$total_1)
to <- sum(subset(df1, new_id!=id)$total_2)
54 * tr / to
}
test$other.RP54 <- sapply(test$new_id, anti_group )
How would I do it in pyspark?
Thanks!
Edit:
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
Then some function that creates a final dataframe that looks like this:
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
So there's no in-built function that would replicate that groupBy function, but you could easily do it by creating a new column using case(when/otherwise clause)
to create your group and anti-group, and then groupBy
on that new column
.
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
from pyspark.sql import functions as F
df.withColumn("anti_id_1", F.when(F.col("id")==1, F.lit('1')).otherwise(F.lit('Not_1')))\
.groupBy("anti_id_1").agg(F.sum("value").alias("sum")).show()
+---------+---+
|anti_id_1|sum|
+---------+---+
| 1| 70|
| Not_1|137|
+---------+---+
UPDATE:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().partitionBy("id")
w=Window().partitionBy()
df.withColumn("grouped_total",F.sum("value").over(w1))\
.withColumn("anti_grouped_total", (F.sum("value").over(w))-F.col("grouped_total"))\
.groupBy("id").agg(F.first("grouped_total").alias("grouped_total"),\
F.first("anti_grouped_total").alias("anti_grouped_total"))\
.drop("value").orderBy("id").show()
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
Less verbose/concise
way to achieve the same output:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value").alias("grouped_total"))\
.withColumn("anti_grouped_total",F.sum("grouped_total").over(w)-F.col("grouped_total")).orderBy("id"),show()
For 2 value columns:
df.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 40| 50|
| 1| 30| 70|
| 2| 10| 91|
| 2| 90| 21|
| 3| 20| 42|
| 3| 10| 4|
| 4| 2| 23|
| 4| 5| 12|
+---+------+------+
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value1").alias("grouped_total_1"),F.sum("value2").alias("grouped_total_2"))\
.withColumn("anti_grouped_total_1",F.sum("grouped_total_1").over(w)-F.col("grouped_total_1"))\
.withColumn("anti_grouped_total_2",F.sum("grouped_total_2").over(w)-F.col("grouped_total_2")).orderBy("id").show()
+---+---------------+---------------+--------------------+--------------------+
| id|grouped_total_1|grouped_total_2|anti_grouped_total_1|anti_grouped_total_2|
+---+---------------+---------------+--------------------+--------------------+
| 1| 70| 120| 137| 193|
| 2| 100| 112| 107| 201|
| 3| 30| 46| 177| 267|
| 4| 7| 35| 200| 278|
+---+---------------+---------------+--------------------+--------------------+