I have a sample dataset which contains id and costs in diff years as the one below:
Id | 2015-04 | 2015-05 | 2015-06 | 2015-07 | 2016-04 | 2016-05 | 2016-06 | 2016-07 | 2017-04 | 2017-05 | 2017-06 | 2017-07 | 2018-04 | 2018-05 | 2018-06 | 2018-07 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 58500 | 58500 | 58300 | 57800 | 57500 | 57700 | 57800 | 57800 | 57800 | 57900 | 58400 | 59000 | 59500 | 59500 | 59000 | 58500 |
11 | 104600 | 104600 | 105700 | 106100 | 106300 | 107300 | 108000 | 107600 | 107800 | 108300 | 109200 | 109600 | 109300 | 108700 | 109000 | 110700 |
12 | 104900 | 106700 | 107900 | 107500 | 106100 | 105200 | 105700 | 106400 | 106700 | 107100 | 107200 | 107100 | 107500 | 108300 | 109200 | 110500 |
13 | 50500 | 49600 | 48900 | 48400 | 48100 | 48000 | 47700 | 47500 | 47400 | 47600 | 47800 | 47800 | 47600 | 47600 | 48100 | 48400 |
14 | 49800 | 49900 | 50300 | 50800 | 51100 | 51200 | 51200 | 51400 | 51600 | 51900 | 52400 | 52600 | 52300 | 51800 | 51100 | 50900 |
How can I create a function in Python to find the median cost of each year belonging to their respective id? I want the function to be dynamic in terms of the start and end year so that if new data comes for different years, the code will calculate the changes accordingly. For example, if new data comes for 2019, the end date would automatically be considered as 2019 instead of 2018 and calculate its median respectively.
With the current data sample given above, the result should look something like one below:
Id | 2015 | 2016 | 2017 | 2018 |
---|---|---|---|---|
10 | 58400 | 57750 | 58150 | 59250 |
11 | 105150 | 107450 | 108750 | 109150 |
12 | 107100 | 105900 | 107100 | 108750 |
13 | 49250 | 47850 | 47700 | 47850 |
14 | 50100 | 51200 | 52150 | 51450 |
First we split the column names on -
and get only the year. Then we groupby over axis=1
based on these years and take the median:
df = df.set_index("Id")
df = df.groupby(df.columns.str.split("-").str[0], axis=1).median().reset_index()
# or get first 4 characters
# df = df.groupby(df.columns.str[:4], axis=1).median().reset_index()
Id 2015 2016 2017 2018
0 10 58400 57750 58150 59250
1 11 105150 107450 108750 109150
2 12 107100 105900 107100 108750
3 13 49250 47850 47700 47850
4 14 50100 51200 52150 51450