Search code examples
pythonpandasloopscomparisonrow

Pandas row by row value comparison to find matching rate between the 2 rows which are string similarity rate is high


I'm struggling for calculating attribute matching rate between the 2 string similarity high item that I matched.

I've tried to 2 variable loops but there were err like this 'IndexError: single positional indexer is out-of-bounds'

The code I was tried was:

nuomlist = pd.DataFrame(dfn.columns, columns=['Col'])
nuomN = nuomlist[nuomlist['Col'].str.contains('-')].index.tolist()

 for i in range(int(nuomN[-1]+1),int(dfn.columns.get_loc("sim_1"))) :
 for j in dfn.index:

  sum(dfn.iloc[j,i]==dfn.iloc[j+dfn.iloc[j,dfn.columns.get_loc('Max_row')],i])/ 
  int(dfn.columns.get_loc("sim_1") - (nuomN[-1] + 1))

This is sample Data set

data = {'S_ITEMCODE':['', '81527800', '', '81527900'],
        'N':['N', '','N', ''],
        'ITEMCODE':['81527800', '81320323', '81527900', '81267337'],
        'DESC':['Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1.5lit', 'Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1.5lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1lit'],
        'ATTR1':['1A', '1A', '1B', '1B'],
        'ATTR2':['1A', '1C', '1B', '1B'],
        'ATTR3':['1A', '1A', '1B', '1B'],
        'ROW_INDEX_SIMILAR_ITEM':[1, -1, 1, 1]}

df = pd.DataFrame(data)

Column 'N' stands for the new item.

I would like to calculate the attribute matching rate for rows where 'N'=='N' between the new item and Jaccard string similarity high item(S_itemcode)

(i.g. 81527800(New item)-81320323, 81527900(New item)-81267337)

This is my desired results.

data1 = {'S_ITEMCODE':['', '81527800', '', '81527900'],
        'N':['N', '','N', ''],
        'ITEMCODE':['81527800', '81320323', '81527900', '81267337'],
        'DESC':['Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1.5lit', 'Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1.5lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1lit'],
        'ATTR1':['1A', '1A', '1B', '1B'],
        'ATTR2':['1A', '1C', '1B', '1B'],
        'ATTR3':['1A', '1A', '1B', '1B'],
        'ROW_INDEX_SIMILAR_ITEM':[1, -1, 1, 1]}
        'ATTR_MATCHING_RATE':[2/3, '', 1, '']}

df = pd.DataFrame(data1)

Please help me... I'm stuck...


Solution

  • This will give you the desired output:

    tested_cols = ['ATTR1', 'ATTR2', 'ATTR3']
    df['matches'] = 0
    for col in tested_cols:
        df.loc[(df['N'] == 'N') & (df[col] == df[col].shift(-1)), 'matches'] += 1
    df['ATTR_MATCHING_RATE'] = df['matches'] / len(tested_cols)
    df.drop('matches', axis=1, inplace=True)