I have the following data
data = {
'Subject': ['3','3','3','3','3','3','3','3','3','10','10','10','10','10','10','10','10','10'],
'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'Value': [19.0959, 19.2321, 19.3088, 19.2589, 19.3085, 19.0455, 19.3491, NaN, 19.1823, 25.7506, 25.8287, NaN, 26.2913, NaN, 26.1501, 25.9447, 25.9493, 25.9629]
}
which becomes this dataframe after pd.DataFrame(data):
Subject Day Value
0 3 1 19.0959
1 3 2 19.2321
2 3 3 19.3088
3 3 4 19.2589
4 3 5 19.3085
5 3 6 19.0455
6 3 7 19.3491
7 3 8 NaN
8 3 9 19.1823
9 10 1 25.7506
10 10 2 25.8287
11 10 3 NaN
12 10 4 26.2913
13 10 5 NaN
14 10 6 26.1501
15 10 7 25.9447
16 10 8 25.9493
17 10 9 25.9629
I have attempted to interpolate the missing data in the 'Value' column but this dataframe being 3 columns seems to be causing issues to where the data is not being interpolated properly within the groups whenever a MUCH LARGER dataframe with many more subjects are involved.
For example:
df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate())
works for small datasets like the one shown, but whenever I apply the same code to replace NaN's with interpolation when there are 1,000 differnt subjects each with 9 days of data values, there are instances for subject 10 where day 5 still remains NaN after interpolation. Any advice on this? Thanks!
The default for the limit_direction parameter is forward. That means if the first element in a series is nan, it won't be interpolated.
Given a slightly modified dataframe:
>>> df
Subject Day Value
0 3 1 19.0959
1 3 2 19.2321
2 3 3 19.3088
3 3 4 19.2589
4 3 5 19.3085
5 3 6 19.0455
6 3 7 19.3491
7 3 8 NaN
8 3 9 19.1823
9 10 1 NaN
10 10 2 25.8287
11 10 3 NaN
12 10 4 26.2913
13 10 5 NaN
14 10 6 26.1501
15 10 7 25.9447
16 10 8 25.9493
17 10 9 25.9629
>>> df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate())
>>> df
Subject Day Value
0 3 1 19.0959
1 3 2 19.2321
2 3 3 19.3088
3 3 4 19.2589
4 3 5 19.3085
5 3 6 19.0455
6 3 7 19.3491
7 3 8 19.2657
8 3 9 19.1823
9 10 1 NaN
10 10 2 25.8287
11 10 3 26.0600
12 10 4 26.2913
13 10 5 26.2207
14 10 6 26.1501
15 10 7 25.9447
16 10 8 25.9493
17 10 9 25.9629
If we repeat the exercise with the above dataframe and use limit_direction='both'
, we get:
[...dataframe omitted...]
>>> df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate(limit_direction='both'))
>>> df
Subject Day Value
0 3 1 19.0959
1 3 2 19.2321
2 3 3 19.3088
3 3 4 19.2589
4 3 5 19.3085
5 3 6 19.0455
6 3 7 19.3491
7 3 8 19.2657
8 3 9 19.1823
9 10 1 25.8287
10 10 2 25.8287
11 10 3 26.0600
12 10 4 26.2913
13 10 5 26.2207
14 10 6 26.1501
15 10 7 25.9447
16 10 8 25.9493
17 10 9 25.9629
See the Pandas interpolate documentation for details.
---- Addendum ----
There has been a follow-up question which hinted at the fact that the data may not always be sorted by day. In this case just sort it before grouping as shown by the following sample (see the reversed order for Subject 10):
>>> data2 = {
... 'Subject': ['3','3','3','3','3','3','3','3','3','10','10','10','10','10','10','10','10','10'],
... 'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 1, 3, 4, 5, 6, 7, 8, 9],
... 'Value': [19.0959, 19.2321, 19.3088, 19.2589, 19.3085, 19.0455, 19.3491, NaN, 19.1823, NaN,
25.8287, 26.0, 26.2913, NaN, 26.1501, 25.9447, 25.9493, 25.9629]
... }
>>> df2['Value'] = df2.sort_values(axis=0, by=['Subject', 'Day']).groupby('Subject')['Value'].transform(lambda group: group.interpolate(limit_direction='both'))
>>> df2
Subject Day Value
0 3 1 19.09590
1 3 2 19.23210
2 3 3 19.30880
3 3 4 19.25890
4 3 5 19.30850
5 3 6 19.04550
6 3 7 19.34910
7 3 8 19.26570
8 3 9 19.18230
9 10 2 25.91435
10 10 1 25.82870
11 10 3 26.00000
12 10 4 26.29130
13 10 5 26.22070
14 10 6 26.15010
15 10 7 25.94470
16 10 8 25.94930
17 10 9 25.96290
Why does this work? Let's have a look at the following sample:
>>> list(df2.sort_values(axis=0, by=['Subject', 'Day']).groupby('Subject')['Value'])
[('10', 10 25.82870
9 25.91435
11 26.00000
12 26.29130
13 26.22070
14 26.15010
15 25.94470
16 25.94930
17 25.96290
Name: Value, dtype: float64), ('3', 0 19.0959
1 19.2321
2 19.3088
3 19.2589
4 19.3085
5 19.0455
6 19.3491
7 19.2657
8 19.1823
Name: Value, dtype: float64)]
Here we got the above dataframe sorted by Subject and Day. You see the group by, the '10'
and '3'
and you also see before the Values
column the index of the original data frame (25.82870 corresponds to index 10 of original data frame and so on) which is then used to assign the values back to the original data frame in the df2['Value'] =
assignment above.
A different question is since you sort the large table already whether you want to keep in the newly sorted order and hence implement this slightly differently but that depends on your exact use case...