Pandas interpolation groupby with 3 columns

I have the following data

data = {
    'Subject': ['3','3','3','3','3','3','3','3','3','10','10','10','10','10','10','10','10','10'],
    'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    'Value': [19.0959, 19.2321, 19.3088, 19.2589, 19.3085, 19.0455, 19.3491, NaN, 19.1823, 25.7506, 25.8287, NaN, 26.2913, NaN, 26.1501, 25.9447, 25.9493, 25.9629]
}

which becomes this dataframe after pd.DataFrame(data):

   Subject   Day    Value
0        3     1  19.0959
1        3     2  19.2321
2        3     3  19.3088
3        3     4  19.2589
4        3     5  19.3085
5        3     6  19.0455
6        3     7  19.3491
7        3     8      NaN
8        3     9  19.1823
9       10     1  25.7506
10      10     2  25.8287
11      10     3      NaN
12      10     4  26.2913
13      10     5      NaN
14      10     6  26.1501
15      10     7  25.9447
16      10     8  25.9493
17      10     9  25.9629

I have attempted to interpolate the missing data in the 'Value' column but this dataframe being 3 columns seems to be causing issues to where the data is not being interpolated properly within the groups whenever a MUCH LARGER dataframe with many more subjects are involved.

For example:

df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate())

works for small datasets like the one shown, but whenever I apply the same code to replace NaN's with interpolation when there are 1,000 differnt subjects each with 9 days of data values, there are instances for subject 10 where day 5 still remains NaN after interpolation. Any advice on this? Thanks!

Solution

The default for the limit_direction parameter is forward. That means if the first element in a series is nan, it won't be interpolated.

Given a slightly modified dataframe:

>>> df
   Subject  Day    Value
0        3    1  19.0959
1        3    2  19.2321
2        3    3  19.3088
3        3    4  19.2589
4        3    5  19.3085
5        3    6  19.0455
6        3    7  19.3491
7        3    8      NaN
8        3    9  19.1823
9       10    1      NaN
10      10    2  25.8287
11      10    3      NaN
12      10    4  26.2913
13      10    5      NaN
14      10    6  26.1501
15      10    7  25.9447
16      10    8  25.9493
17      10    9  25.9629
>>> df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate())
>>> df
   Subject  Day    Value
0        3    1  19.0959
1        3    2  19.2321
2        3    3  19.3088
3        3    4  19.2589
4        3    5  19.3085
5        3    6  19.0455
6        3    7  19.3491
7        3    8  19.2657
8        3    9  19.1823
9       10    1      NaN
10      10    2  25.8287
11      10    3  26.0600
12      10    4  26.2913
13      10    5  26.2207
14      10    6  26.1501
15      10    7  25.9447
16      10    8  25.9493
17      10    9  25.9629

If we repeat the exercise with the above dataframe and use limit_direction='both', we get:

[...dataframe omitted...]
>>> df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate(limit_direction='both'))
>>> df
   Subject  Day    Value
0        3    1  19.0959
1        3    2  19.2321
2        3    3  19.3088
3        3    4  19.2589
4        3    5  19.3085
5        3    6  19.0455
6        3    7  19.3491
7        3    8  19.2657
8        3    9  19.1823
9       10    1  25.8287
10      10    2  25.8287
11      10    3  26.0600
12      10    4  26.2913
13      10    5  26.2207
14      10    6  26.1501
15      10    7  25.9447
16      10    8  25.9493
17      10    9  25.9629

See the Pandas interpolate documentation for details.

---- Addendum ----

There has been a follow-up question which hinted at the fact that the data may not always be sorted by day. In this case just sort it before grouping as shown by the following sample (see the reversed order for Subject 10):

>>> data2 = {
...     'Subject': ['3','3','3','3','3','3','3','3','3','10','10','10','10','10','10','10','10','10'],
...     'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 1, 3, 4, 5, 6, 7, 8, 9],
...     'Value': [19.0959, 19.2321, 19.3088, 19.2589, 19.3085, 19.0455, 19.3491, NaN, 19.1823, NaN,
25.8287, 26.0, 26.2913, NaN, 26.1501, 25.9447, 25.9493, 25.9629]
... }
>>> df2['Value'] = df2.sort_values(axis=0, by=['Subject', 'Day']).groupby('Subject')['Value'].transform(lambda group: group.interpolate(limit_direction='both'))
>>> df2
   Subject  Day     Value
0        3    1  19.09590
1        3    2  19.23210
2        3    3  19.30880
3        3    4  19.25890
4        3    5  19.30850
5        3    6  19.04550
6        3    7  19.34910
7        3    8  19.26570
8        3    9  19.18230
9       10    2  25.91435
10      10    1  25.82870
11      10    3  26.00000
12      10    4  26.29130
13      10    5  26.22070
14      10    6  26.15010
15      10    7  25.94470
16      10    8  25.94930
17      10    9  25.96290

Why does this work? Let's have a look at the following sample:

>>> list(df2.sort_values(axis=0, by=['Subject', 'Day']).groupby('Subject')['Value'])
[('10', 10    25.82870
9     25.91435
11    26.00000
12    26.29130
13    26.22070
14    26.15010
15    25.94470
16    25.94930
17    25.96290
Name: Value, dtype: float64), ('3', 0    19.0959
1    19.2321
2    19.3088
3    19.2589
4    19.3085
5    19.0455
6    19.3491
7    19.2657
8    19.1823
Name: Value, dtype: float64)]

Here we got the above dataframe sorted by Subject and Day. You see the group by, the '10' and '3' and you also see before the Values column the index of the original data frame (25.82870 corresponds to index 10 of original data frame and so on) which is then used to assign the values back to the original data frame in the df2['Value'] = assignment above.

A different question is since you sort the large table already whether you want to keep in the newly sorted order and hence implement this slightly differently but that depends on your exact use case...