Search code examples
python-3.xdataframeduplicatesuser-defined-functionsdata-cleaning

changing values in data frame based on duplicates - python


I have a quite large data set of over 100k rows with many duplicates and some missing or faulty values. Trying to simplify the problem in the snippet below.

sampleData = {
    'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
    'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
    'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
    'BI Telephone' : ['999', '999', '666', np.nan, '12345']    
}
df = pd.DataFrame(sampleData)

I'm trying to change the values based on duplicate rows so if any three fields are matching then the forth one should match as well. I should get outcome like this:

result = {
    'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
    'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
    'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', 'www@3'],
    'BI Telephone' : ['999', '999', '666', '12345', '12345']    
}
df = pd.DataFrame(result)

I have found extremely long winded method - here showing just the part for changing the name.

df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone'] 
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] + 
reference_name['BI Telephone'] 
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name

def replace_name(row):
    try:
        old_name = row['BI Business Name']
        reference = row['Phone_code_web']     
        new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]  
        print(new_name)

        return new_name
    except Exception as e:
        return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df

Is there easier way of doing this?


Solution

  • You can try this:

    import pandas as pd
    
    sampleData = {
        'BI Business Name': ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
        'BId Postcode': ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
        'BI Website': ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
        'BI Telephone': ['999', '999', '666', np.nan, '12345']
    }
    df = pd.DataFrame(sampleData)
    print(df)
    
    def fill_gaps(_df, _x):  # _df and _x are local variables that represent the dataframe and one of its rows, respectively
        # pd.isnull(_x) = list of Booleans indicating which columns have NaNs
        # df.columns[pd.isnull(_x)] = list of columns whose value is a NaN
        for col in df.columns[pd.isnull(_x)]:
            # len(set(y) & set(_x)) = length of the intersection of the row being considered (_x) and each of the other rows in turn (y)
            # the mask is a list of Booleans which are True if:
            # 1) y[col] is not Null (e.g. for row 3 we need to replace (BI Telephone = NaN) with a non-NaN 'BI Telephone' value)
            # 2) and the length of the intersection above is at least 3 (as required)
            mask = df.apply(lambda y: pd.notnull(y[col]) and len(set(y) & set(_x)) == 3, axis=1)
            # if the mask has at least one "True" value, select the value in the corresponding column (if there are several possible values, select the first one)
            _x[col] = df[mask][col].iloc[0] if any(mask) else _x[col]
        return _x
    
    # Apply the logic described above to each row in turn (x = each row)
    df = df.apply(lambda x: fill_gaps(df, x), axis=1)
    
    print(df)
    

    Output:

      BI Business Name BId Postcode BI Website BI Telephone
    0              AAA      NW1 8NZ      www@1          999
    1              BBB      NW1 8NZ      www@1          999
    2              CCC     WC2N 4AA      www@2          666
    3              DDD      CV7 9JY      www@3          NaN
    4              DDD      CV7 9JY        NaN        12345
      BI Business Name BId Postcode BI Website BI Telephone
    0              AAA      NW1 8NZ      www@1          999
    1              BBB      NW1 8NZ      www@1          999
    2              CCC     WC2N 4AA      www@2          666
    3              DDD      CV7 9JY      www@3        12345
    4              DDD      CV7 9JY      www@3        12345