Search code examples
pythonpython-3.xpandasdataframeweighted-average

Obtain data in the desired format


I am a bit new to Python and I have the following requirement.

Value         Date  Ticket  Source  Result
0.939531371 3/1/2018    T1  Source1  1
0.951619897 3/1/2018    T2  Source1  1
0.952681914 3/1/2018    T3  Source1  0
0.957009407 3/1/2018    T4  Source2  1
0.962669466 3/1/2018    T5  Source2  0 
0.963068552 3/1/2018    T6  Source3  1
0.963480195 3/1/2018    T7  Source4  1
0.951296258 3/2/2018    T11 Source5  1
0.962434762 3/2/2018    T12 Source5  1
0.950224149 3/2/2018    T13 Source5  1
0.961191873 3/2/2018    T14 Source5  1
0.952584896 3/2/2018    T15 Source5  0
0.962093595 3/2/2018    T16 Source5  1
0.975999737 3/2/2018    T17 Source5  1

From this data frame I wanted to generate the below dataframe-

Date      Source1 Source2 Source3 Source4 Source5 Overall_Result
3/1/2018   0.93     0.95   0.96     0.96      0        0.7
3/2/2018     0        0      0        0      0.95      0.85

The calculation involved here is of weighted average-

Source1 - For a given Date and for a given Source, we need to obtain the total number of tickets. For Source1, there are 3 records (tickets) on a given date(3/1/2018). The "Value" column for these tickets have to be sorted in ascending order. Then based on the count of tickets, the highest weightage has to be given to the least "Value"

0.945=(0.939531370722655*3) +(0.951619897215127*2)+(1*0.952681914218488)/3+2+1

Overall Result column is calculated as for a given date,

how many were 1s divided by overall count of tickets for that date Date - 3/1/2018 => 1+1+0+1+0+1+1+0+1+1/10=0.66

I have huge amount of data for which these calculations has to be done. The number of Source column values can also be immense. In modified dataframe, I want it as a column. One way to do is to write the logic in a function and call upon every record. Any suggestions or help are welcome. Thanks in advance.


Solution

  • You can use pivot_table with a custom aggregation function to get the first columns. And groupby to add the 'Result' column.

    import numpy as np
    import pandas as pd
    
    df2 = df.sort_values('Value').pivot_table(
            index='Date', 
            columns='Source',
            values='Value',
            aggfunc = lambda x: (x*np.arange(len(x), 0, -1)).sum()/np.arange(len(x), 0, -1).cumsum()[-1]).fillna(0)
    
    df2['Result'] = df.groupby('Date').Result.apply(lambda x: x.sum()/np.size(x))
    

    Output:

    Source     Source1   Source2   Source3  Source4   Source5    Result
    Date                                                               
    3/1/2018  0.945753  0.958896  0.963069  0.96348  0.000000  0.714286
    3/2/2018  0.000000  0.000000  0.000000  0.00000  0.955507  0.857143