I have a dataframe (df) with 10 columns. The index has many different dates, however there are multiple identical dates (and it is sorted by date). Additionally the important columns for this problem are df['Weight'] and df['Price'].
Here is an example of the data with 2 columns for only 1 index value (1/21/2017), in reality there are multiple dates with multiple weights etc.
Weight Price
1/21/2017 0.1 12
1/21/2017 0.04 21
1/21/2017 0.03 13
1/21/2017 0.02 22
1/21/2017 0.2 27
1/21/2017 0.001 30
1/21/2017 0.1 34
1/21/2017 0.21 21
1/21/2017 0.003 12
1/21/2017 0.01 32
1/21/2017 0.04 21
1/21/2017 0.005 12
1/21/2017 0.05 10
1/21/2017 0.1 3
1/21/2017 0.091 24
The weights add up to 1 for a particular index, and this is true for every unique date of the index.
Now I am trying to create a column df['adjusted weight'] which will be based off the percent rank of each value of Price for a particular date.
The price which is in the middle 80% relative to other prices for a particular date, the df['adjusted weight'] would be equal to df['Weight'].
For the bottom 10% of Prices for a particular date, df['Adjusted weight'], would be df['Weight'] / 2 .
For the top 10% of Prices for any date, We need to adjust the Adjusted Weight so that the new df[adjusted weight'] for that date would now equal 1.
This would be done by dividing all the weights for where Price is in top 10% for a date by the current sum of weight of the top 10% Prices, and then multiplying by this by that number itself + the sum of the weight which was chopped off from the bottom 10%, to make it 1 again.
Here is the output I desire:
Weight Price Percent rank Adjusted Weight
1/21/2017 0.1 12 0.142 0.1
1/21/2017 0.04 21 0.428 0.04
1/21/2017 0.03 13 0.357 0.03
1/21/2017 0.02 22 0.642 0.02
1/21/2017 0.2 27 0.785 0.2
1/21/2017 0.001 30 0.857 0.001
1/21/2017 0.1 34 1 **0.168181818**
1/21/2017 0.21 21 0.428 0.21
1/21/2017 0.003 12 0.142 0.003
1/21/2017 0.01 32 0.928 **0.016818182**
1/21/2017 0.04 21 0.428 0.04
1/21/2017 0.005 12 0.142 0.005
1/21/2017 0.05 10 0.071 **0.025**
1/21/2017 0.1 3 0 **0.05**
1/21/2017 0.091 24 0.714 0.091
I have put the ** around the values which have changed, and coincedentally there are 2 values at bottom 10% pct rank and 2 values at top 10% rank. For the bottom 2 values, its just weight/2 , and for top 2 I wrote this formula in excel, =E8/(E11+E8)*(E11+E8+E15+E14-(H15+H14)) where tha table spans from D1 to H16.
Its a pretty tricky problem, but I hope to have asked in reasonable manner. If somebody could advise and help I would very much appreciate it. Thank you for reading and love to hear the ideas which I get. Additionally the dataset is large, so I am not sure how long it would take as it would need to do the same work for each date.
I would like for the adjusted weight to be a new column which exist alongside the original weights please. Additionally something which would work with other columns as well as price would be great.
Thanks all so much for the help and support.
Best wishes.
def adjust(df):
df = df.copy()
b, t = df.Price.quantile([.1, .9])
ltb = df.Price.lt(b)
get = df.Price.ge(t)
bsum = df.Weight[ltb].sum()
tsum = df.Weight[get].sum()
df.loc[ltb, 'Weight'] /= 2
df.loc[get, 'Weight'] *= (bsum / 2 + tsum) / tsum
return df
df.groupby(level=0).apply(adjust).reset_index(0, drop=True)
Weight Price
2017-01-21 0.100000 12
2017-01-21 0.040000 21
2017-01-21 0.030000 13
2017-01-21 0.020000 22
2017-01-21 0.200000 27
2017-01-21 0.001000 30
2017-01-21 0.168182 34
2017-01-21 0.210000 21
2017-01-21 0.003000 12
2017-01-21 0.016818 32
2017-01-21 0.040000 21
2017-01-21 0.005000 12
2017-01-21 0.025000 10
2017-01-21 0.050000 3
2017-01-21 0.091000 24