Search code examples
pythonpandasdataframeindexingsubstitution

How to substitute values in a column in a dataframe based on its column name, values in another column and index range?


I have a dataframe with these characteristics (the indexes are float values):

import pandas as pd

d = {'A': [1,2,3,4,5,6,7,8,9,10],
     'B': [1,2,3,4,5,6,7,8,9,10],
     'C': [1,2,3,4,5,6,7,8,9,10],
     'D': ['one','one','one','one','one','two','two','two','two','two']}

df = pd.DataFrame(data=d)
df
        A   B   C    D
50.0    1   1   1  one
50.2    2   2   2  one
50.4    3   3   3  one
50.6    4   4   4  one
50.8    5   5   5  one
51.0    6   6   6  two
51.2    7   7   7  two
51.4    8   8   8  two
51.6    9   9   9  two
51.8   10  10  10  two

And a list of offsets with these values (they are also floats):

offsets = [[0.4, 0.6, 0.8], [0.2, 0.4, 0.6]]

I need to iterate through my dataframe over columns A, B and C, choosing the categorical values from column D, replacing the last values from columns A, B and C by nan according their indexes in relation to the offsets in my list, resulting in a dataframe like this:

        A   B   C    D
50.0    1   1   1  one
50.2    2   2  nan one
50.4    3  nan nan one
50.6   nan nan nan one
50.8   nan nan nan one
51.0    6   6   6  two
51.2    7   7   7  two
51.4    8   8  nan two
51.6    9  nan nan two
51.8   nan nan nan two

The value of the offset means what values must be set to nan from the bottom up. For example: offsets[0][0]=0.4, so for column A when D == 'one', the two values from the bottom up must be set to nan (rows 4 and 3, 50.8-0.4 = 50.4 - 50.4 doesn't change). For A when D == 'two', the offsets[1][0]=0.2, so one value from the bottom up must be set to nan (row 9, 51.8-0.2 = 51.6 - 51.6 doesn't change). Offsets[1][0]=0.6, so for column B when D == 'one', the three values from the bottom up must be set to nan (rows 4, 3 and 2, 50.8-0.6 = 50.2 - 50.2 doesn't change). For B when D == 'two', the offsets[1][1]=0.4, so two values from the bottom up must be set to nan (rows 9 and 8, 51.8-0.4 = 51.4 - 51.4 doesn't change). For column C is the same.

Any idea how to do this? A quick comment - I want to replace these values in the dataframe itself, without creating a new one.


Solution

  • One approach is to use apply to set the last values of each column to NaN:

    import pandas as pd
    
    # toy data
    df = pd.DataFrame(data={'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                            'B': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                            'C': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                            'D': ['one', 'one', 'one', 'one', 'one', 'two', 'two', 'two', 'two', 'two']})
    offsets = [2, 3, 4]
    offset_lookup = dict(zip(df.columns[:3], offsets))
    
    
    def funny_shift(x, ofs=None):
        """This function shift each column by the given offset in the ofs parameter"""
        for column, offset in ofs.items():
            x.loc[x.index[-1 * offset:], column] = None
        return x
    
    
    df.loc[:, ["A", "B", "C"]] = df.groupby("D").apply(funny_shift, ofs=offset_lookup)
    print(df)
    

    Output

         A    B    C    D
    0  1.0  1.0  1.0  one
    1  2.0  2.0  NaN  one
    2  3.0  NaN  NaN  one
    3  NaN  NaN  NaN  one
    4  NaN  NaN  NaN  one
    5  6.0  6.0  6.0  two
    6  7.0  7.0  NaN  two
    7  8.0  NaN  NaN  two
    8  NaN  NaN  NaN  two
    9  NaN  NaN  NaN  two
    

    UPDATE

    If you have multiple updates per group, you could do:

    offsets = [[2, 3, 4], [1, 2, 3]]
    offset_lookup = (dict(zip(df.columns[:3], offset)) for offset in offsets)
    
    
    def funny_shift(x, ofs=None):
        """This function shift each column by the given offset in the ofs parameter"""
        current = next(ofs)
        for column, offset in current.items():
            x.loc[x.index[-1 * offset:], column] = None
        return x
    
    
    df.loc[:, ["A", "B", "C"]] = df.groupby("D").apply(funny_shift, ofs=offset_lookup)
    print(df)