I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to do:
Let's say I have a table as follows:
stock | day | time | hour | price | units | price_x_units |
---|---|---|---|---|---|---|
A | 1 | 1:10 | 1 | 2.1 | 3 | 6.3 |
A | 1 | 1:15 | 1 | 2.0 | 4 | 8.0 |
A | 2 | 1:03 | 1 | 2.2 | 2 | 4.4 |
A | 2 | 2:45 | 2 | 1.0 | 4 | 4.0 |
B | 1 | 2:12 | 2 | 1.2 | 5 | 6.0 |
... | ... | ... | ... | ... | ... |
So the data above represents stock prices that fluctuate in price during the day, and the number of units bought at specific times during that day. I can pivot this as follows:
pivotdf = df.groupby("stock", "hour").pivot("day").sum("price_x_units", "units")
To give me
(stock,hour) \ day | 1 | 2 | 3 | 4 ... |
---|---|---|---|---|
(A,1) | (14.3, 7) | (4.4, 2) | ... | ... |
(A,2) | (4.0, 2) | ... | ||
(B,2) | (6.0,5) | |||
... | ... | ... | ... | ... |
However I want to perform the calculation "price_x_units" / "units" for each resultant cell:
(stock,hour) \ day | 1 | 2 | 3 | 4 ... |
---|---|---|---|---|
(A,1) | 2.04 | 2.2 | ... | ... |
(A,2) | 2.0 | ... | ||
(B,2) | 1.2 | |||
... | ... | ... | ... | ... |
Further to this, I then want to aggregate (sum) away the stock to give me:
hour\day | 1 | 2 | 3 | 4 |
---|---|---|---|---|
1 | 2.04 | 2.2 | ||
2 | 3.2 |
How exactly do I do this? Thanks.
After the pivot, you could compute the division for each day like this:
# extract the days that are present in the df:
days = set([i.split('_')[0] for i in pivoted_df.columns[2:]])
# you can also just use range(1, 32) if you are sure that all the days are in the df
pivoted_df\
.select('stock', 'hour',
*[(F.col(i+'_sum(price_x_units)') / F.col(i+'_sum(units)')).alias(i) for i in days]
).show()
+-----+----+-----------------+----+
|stock|hour| 1| 2|
+-----+----+-----------------+----+
| A| 1|2.042857142857143| 2.2|
| A| 2| null| 1.0|
| B| 2| 1.2|null|
+-----+----+-----------------+----+