Search code examples
pythonpandasdataframeindexingconditional-formatting

Change value in a Pandas dataframe column based on several conditions


What I have is a long Pandas dataframe in Python that contains three columns named 'file', 'comment', and 'number'. A simple example is:

import pandas as pd
df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1', 'file_1_v2', 'file_2', 'file_2', 'file_3', 'file_4', 'file_4_v2', 'file_5']
df_test.comment = ['none: 2', 'old', 'Replacing: file_1', 'v1', 'v2', 'none', 'old', 'Replacing: file_4', 'none']
df_test.number = [12, 12, 12, 13, 13, 13, 14, 14, 15]

Each file should have a unique number associated with it, but it currently has numerous errors where many unique files have been given the same number. There are also files which has the same name but are different versions that should have the number and files which have different names but the comment will show that they are supposed to have the same number as well.

In the example, files that have the same name or has a comment that starts with the string 'Replacing: ' should not have the number changed, but if the file has a different name but the same number as a previous file, I want the number of that file and every subsequent number to increase by one, meaning the end result here should be:

[12, 12, 12, 13, 13, 14, 15, 15, 16]

My idea was to check if each file has the same number as the previous in the list, and if it does, and the name of the file is not the same, and the comment does not start with the string 'Replacing: ', the value of the number and all following numbers will increase by one, but I am not sure how to write this code. Any help is really appreciated, thanks!


Solution

  • You can extract the file name, and fillna, then factorize and add the min:

    df_test['number'] = pd.factorize(df_test['comment']
                                     .str.extract('Replacing: (.*)', expand=False)
                                     .fillna(df_test['file'])
                                    )[0]+df_test['number'].min()
    

    Output:

            file            comment  number
    0     file_1            none: 2      12
    1     file_1                old      12
    2  file_1_v2  Replacing: file_1      12
    3     file_2                 v1      13
    4     file_2                 v2      13
    5     file_3               none      14
    6     file_4                old      15
    7  file_4_v2  Replacing: file_4      15
    8     file_5               none      16