I am a bit new to Python and I have the following requirement.
Value Date Ticket Source Result
0.939531371 3/1/2018 T1 Source1 1
0.951619897 3/1/2018 T2 Source1 1
0.952681914 3/1/2018 T3 Source1 0
0.957009407 3/1/2018 T4 Source2 1
0.962669466 3/1/2018 T5 Source2 0
0.963068552 3/1/2018 T6 Source3 1
0.963480195 3/1/2018 T7 Source4 1
0.951296258 3/2/2018 T11 Source5 1
0.962434762 3/2/2018 T12 Source5 1
0.950224149 3/2/2018 T13 Source5 1
0.961191873 3/2/2018 T14 Source5 1
0.952584896 3/2/2018 T15 Source5 0
0.962093595 3/2/2018 T16 Source5 1
0.975999737 3/2/2018 T17 Source5 1
From this data frame I wanted to generate the below dataframe-
Date Source1 Source2 Source3 Source4 Source5 Overall_Result
3/1/2018 0.93 0.95 0.96 0.96 0 0.7
3/2/2018 0 0 0 0 0.95 0.85
The calculation involved here is of weighted average-
Source1 - For a given Date and for a given Source, we need to obtain the total number of tickets. For Source1, there are 3 records (tickets) on a given date(3/1/2018). The "Value" column for these tickets have to be sorted in ascending order. Then based on the count of tickets, the highest weightage has to be given to the least "Value"
0.945=(0.939531370722655*3) +(0.951619897215127*2)+(1*0.952681914218488)/3+2+1
Overall Result column is calculated as for a given date,
how many were 1s divided by overall count of tickets for that date Date - 3/1/2018 => 1+1+0+1+0+1+1+0+1+1/10=0.66
I have huge amount of data for which these calculations has to be done. The number of Source column values can also be immense. In modified dataframe, I want it as a column. One way to do is to write the logic in a function and call upon every record. Any suggestions or help are welcome. Thanks in advance.
You can use pivot_table
with a custom aggregation function to get the first columns. And groupby
to add the 'Result' column.
import numpy as np
import pandas as pd
df2 = df.sort_values('Value').pivot_table(
index='Date',
columns='Source',
values='Value',
aggfunc = lambda x: (x*np.arange(len(x), 0, -1)).sum()/np.arange(len(x), 0, -1).cumsum()[-1]).fillna(0)
df2['Result'] = df.groupby('Date').Result.apply(lambda x: x.sum()/np.size(x))
Source Source1 Source2 Source3 Source4 Source5 Result
Date
3/1/2018 0.945753 0.958896 0.963069 0.96348 0.000000 0.714286
3/2/2018 0.000000 0.000000 0.000000 0.00000 0.955507 0.857143