What I have is a long Pandas dataframe in Python that contains three columns named 'file', 'comment', and 'number'. A simple example is:
import pandas as pd
df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1', 'file_1_v2', 'file_2', 'file_2', 'file_3', 'file_4', 'file_4_v2', 'file_5']
df_test.comment = ['none: 2', 'old', 'Replacing: file_1', 'v1', 'v2', 'none', 'old', 'Replacing: file_4', 'none']
df_test.number = [12, 12, 12, 13, 13, 13, 14, 14, 15]
Each file should have a unique number associated with it, but it currently has numerous errors where many unique files have been given the same number. There are also files which has the same name but are different versions that should have the number and files which have different names but the comment will show that they are supposed to have the same number as well.
In the example, files that have the same name or has a comment that starts with the string 'Replacing: ' should not have the number changed, but if the file has a different name but the same number as a previous file, I want the number of that file and every subsequent number to increase by one, meaning the end result here should be:
[12, 12, 12, 13, 13, 14, 15, 15, 16]
My idea was to check if each file has the same number as the previous in the list, and if it does, and the name of the file is not the same, and the comment does not start with the string 'Replacing: ', the value of the number and all following numbers will increase by one, but I am not sure how to write this code. Any help is really appreciated, thanks!
You can extract
the file name, and fillna
, then factorize
and add the min
:
df_test['number'] = pd.factorize(df_test['comment']
.str.extract('Replacing: (.*)', expand=False)
.fillna(df_test['file'])
)[0]+df_test['number'].min()
Output:
file comment number
0 file_1 none: 2 12
1 file_1 old 12
2 file_1_v2 Replacing: file_1 12
3 file_2 v1 13
4 file_2 v2 13
5 file_3 none 14
6 file_4 old 15
7 file_4_v2 Replacing: file_4 15
8 file_5 none 16