Search code examples
pythongroup-by

Pandas interpolation groupby with 3 columns


I have the following data

data = {
    'Subject': ['3','3','3','3','3','3','3','3','3','10','10','10','10','10','10','10','10','10'],
    'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    'Value': [19.0959, 19.2321, 19.3088, 19.2589, 19.3085, 19.0455, 19.3491, NaN, 19.1823, 25.7506, 25.8287, NaN, 26.2913, NaN, 26.1501, 25.9447, 25.9493, 25.9629]
}

which becomes this dataframe after pd.DataFrame(data):

   Subject   Day    Value
0        3     1  19.0959
1        3     2  19.2321
2        3     3  19.3088
3        3     4  19.2589
4        3     5  19.3085
5        3     6  19.0455
6        3     7  19.3491
7        3     8      NaN
8        3     9  19.1823
9       10     1  25.7506
10      10     2  25.8287
11      10     3      NaN
12      10     4  26.2913
13      10     5      NaN
14      10     6  26.1501
15      10     7  25.9447
16      10     8  25.9493
17      10     9  25.9629

I have attempted to interpolate the missing data in the 'Value' column but this dataframe being 3 columns seems to be causing issues to where the data is not being interpolated properly within the groups whenever a MUCH LARGER dataframe with many more subjects are involved.

For example:

df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate())

works for small datasets like the one shown, but whenever I apply the same code to replace NaN's with interpolation when there are 1,000 differnt subjects each with 9 days of data values, there are instances for subject 10 where day 5 still remains NaN after interpolation. Any advice on this? Thanks!


Solution

  • The default for the limit_direction parameter is forward. That means if the first element in a series is nan, it won't be interpolated.

    Given a slightly modified dataframe:

    >>> df
       Subject  Day    Value
    0        3    1  19.0959
    1        3    2  19.2321
    2        3    3  19.3088
    3        3    4  19.2589
    4        3    5  19.3085
    5        3    6  19.0455
    6        3    7  19.3491
    7        3    8      NaN
    8        3    9  19.1823
    9       10    1      NaN
    10      10    2  25.8287
    11      10    3      NaN
    12      10    4  26.2913
    13      10    5      NaN
    14      10    6  26.1501
    15      10    7  25.9447
    16      10    8  25.9493
    17      10    9  25.9629
    >>> df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate())
    >>> df
       Subject  Day    Value
    0        3    1  19.0959
    1        3    2  19.2321
    2        3    3  19.3088
    3        3    4  19.2589
    4        3    5  19.3085
    5        3    6  19.0455
    6        3    7  19.3491
    7        3    8  19.2657
    8        3    9  19.1823
    9       10    1      NaN
    10      10    2  25.8287
    11      10    3  26.0600
    12      10    4  26.2913
    13      10    5  26.2207
    14      10    6  26.1501
    15      10    7  25.9447
    16      10    8  25.9493
    17      10    9  25.9629
    

    If we repeat the exercise with the above dataframe and use limit_direction='both', we get:

    [...dataframe omitted...]
    >>> df['Value'] = df.groupby('Subject')['Value'].transform(lambda group: group.interpolate(limit_direction='both'))
    >>> df
       Subject  Day    Value
    0        3    1  19.0959
    1        3    2  19.2321
    2        3    3  19.3088
    3        3    4  19.2589
    4        3    5  19.3085
    5        3    6  19.0455
    6        3    7  19.3491
    7        3    8  19.2657
    8        3    9  19.1823
    9       10    1  25.8287
    10      10    2  25.8287
    11      10    3  26.0600
    12      10    4  26.2913
    13      10    5  26.2207
    14      10    6  26.1501
    15      10    7  25.9447
    16      10    8  25.9493
    17      10    9  25.9629
    

    See the Pandas interpolate documentation for details.

    ---- Addendum ----

    There has been a follow-up question which hinted at the fact that the data may not always be sorted by day. In this case just sort it before grouping as shown by the following sample (see the reversed order for Subject 10):

    >>> data2 = {
    ...     'Subject': ['3','3','3','3','3','3','3','3','3','10','10','10','10','10','10','10','10','10'],
    ...     'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 1, 3, 4, 5, 6, 7, 8, 9],
    ...     'Value': [19.0959, 19.2321, 19.3088, 19.2589, 19.3085, 19.0455, 19.3491, NaN, 19.1823, NaN,
    25.8287, 26.0, 26.2913, NaN, 26.1501, 25.9447, 25.9493, 25.9629]
    ... }
    >>> df2['Value'] = df2.sort_values(axis=0, by=['Subject', 'Day']).groupby('Subject')['Value'].transform(lambda group: group.interpolate(limit_direction='both'))
    >>> df2
       Subject  Day     Value
    0        3    1  19.09590
    1        3    2  19.23210
    2        3    3  19.30880
    3        3    4  19.25890
    4        3    5  19.30850
    5        3    6  19.04550
    6        3    7  19.34910
    7        3    8  19.26570
    8        3    9  19.18230
    9       10    2  25.91435
    10      10    1  25.82870
    11      10    3  26.00000
    12      10    4  26.29130
    13      10    5  26.22070
    14      10    6  26.15010
    15      10    7  25.94470
    16      10    8  25.94930
    17      10    9  25.96290
    

    Why does this work? Let's have a look at the following sample:

    >>> list(df2.sort_values(axis=0, by=['Subject', 'Day']).groupby('Subject')['Value'])
    [('10', 10    25.82870
    9     25.91435
    11    26.00000
    12    26.29130
    13    26.22070
    14    26.15010
    15    25.94470
    16    25.94930
    17    25.96290
    Name: Value, dtype: float64), ('3', 0    19.0959
    1    19.2321
    2    19.3088
    3    19.2589
    4    19.3085
    5    19.0455
    6    19.3491
    7    19.2657
    8    19.1823
    Name: Value, dtype: float64)]
    

    Here we got the above dataframe sorted by Subject and Day. You see the group by, the '10' and '3' and you also see before the Values column the index of the original data frame (25.82870 corresponds to index 10 of original data frame and so on) which is then used to assign the values back to the original data frame in the df2['Value'] = assignment above.

    A different question is since you sort the large table already whether you want to keep in the newly sorted order and hence implement this slightly differently but that depends on your exact use case...