Search code examples
pythonpandasdataframeperformanceduplicates

How to efficiently reinstate duplicates in pandas


So I have a big pandas dataframe, with lots of duplicate rows, that needs processing. I can remove the duplicates easily enough but, having processed the dataframe, what I need to do is reinstate the duplicates in the same position as they were originally. EDIT And with the same processing as has been applied to the row that they duplicate.

I am doing this using a dictionary that tracks structure of the rows in the original dataframe - the dictionary keys are the row indices of the unique rows and the values are lists of duplicates (so a row that has no duplicates just appears as i: [i] and a row that has multiple duplicates appears as i: [j, k, l, ...]).

The code below is a toy example that does the job (obviously in the real application the processing is a bit more complicated than addition and multiplication). However, in the real application the loop that builds the dictionary is very slow, because for each duplicate row, I have to find where the first occurrence of that row is.

I'm sure there must be an efficient way to do this, as Dataframe.duplicated() is really fast and that has to do a similar job. Any thoughts on how to implement this efficiently would be much appreciated. Thanks!

EDIT 2; In the toy example I am just processing the original data, but this is perhaps misleading - I need to apply machine learning code to the de-duplicated dataframe, then apply the reinstating of duplicated rows to the ML output. So (I think) I really need to keep track of where duplicates originated from, rather than just applying some processing to the duplicates and merging them.

import pandas as pd

dataframe = pd.DataFrame([[0, 1], [0, 0], [1, 0], [1, 1], [0, 0], [1, 1], [1, 2], [0, 0]], columns=['X', 'Y'])
print(dataframe)

index = dataframe.index
duplicates_mask = dataframe.duplicated()
print(duplicates_mask)
duplicates_structure = {}
deduplicated_indices = {}
for i in range(len(dataframe)):
    duplicates_structure[i] = []
deduplicated_index = 0
for i, is_duplicate in enumerate(duplicates_mask):  # way too slow :(
    if is_duplicate:
        for j in range(i):
            if duplicates_mask[j]:  # j is a duplicate, don't bother checking
                continue
            if (dataframe.iloc[i] == dataframe.iloc[j]).all():
                duplicates_structure[j].append(i)
                break
    else:
        duplicates_structure[i].append(i)
        deduplicated_indices[i] = deduplicated_index
        deduplicated_index += 1
print(duplicates_structure)

# drop the duplicates
dataframe.drop_duplicates(inplace=True)

# do some processing on the data
dataframe['X'] += 1
dataframe['Y'] *= 2
print(dataframe)

# reinstate the structure with duplicates processed
dataframe_reinstated = pd.DataFrame(index=index, columns=['X', 'Y'])
for i in duplicates_structure:
    for j in duplicates_structure[i]:
        dataframe_reinstated.iloc[j] = dataframe.iloc[deduplicated_indices[i]]
print(dataframe_reinstated)

EDIT E.g. input:

   X  Y
0  0  1
1  0  0
2  1  0
3  1  1
4  0  0
5  1  1
6  1  2
7  0  0

Output (after adding 1 to column X and multiplying column Y by 2):

   X  Y
0  1  2
1  1  0
2  2  0
3  2  2
4  1  0
5  2  2
6  2  4
7  1  0

Solution

  • this is the same concept of e-motta
    but with your full code so you can run and test it

    import pandas as pd
    
    dataframe = pd.DataFrame([[0, 1], [0, 0], [1, 0], [1, 1], [0, 0], [1, 1], [1, 2], [0, 0]], columns=['X', 'Y'])
    print(dataframe)
    
    #add key to data so later we can join on 
    dataframe['key'] = pd.util.hash_pandas_object(dataframe, index=False)
    full_datafrme = dataframe.copy()
    # drop the duplicates
    dataframe.drop_duplicates(inplace=True)
    
    # do some processing on the data
    dataframe['X_process'] = dataframe['X'] + 1
    dataframe['Y_process'] = dataframe['Y'] * 2
    dataframe = dataframe.drop(['X','Y'], axis=1)
    #join/merge the table by the key 
    full_dataframe_results = pd.merge(full_datafrme, dataframe, on='key')
    #change table columns names to return to origine after the processing 
    full_dataframe_results = full_dataframe_results.drop(['X','Y','key'], axis=1)
    full_dataframe_results = full_dataframe_results.rename(columns={"X_process": "X", "Y_process": "Y"})
    print(full_dataframe_results)
    

    I want to give you a different way that may be more efficient then join

    import pandas as pd
    
    dataframe = pd.DataFrame([[0, 1], [0, 0], [1, 0], [1, 1], [0, 0], [1, 1], [1, 2], [0, 0]], columns=['X', 'Y'])
    print(dataframe)
    
    #add key to data so later we can join on
    dataframe['key'] = pd.util.hash_pandas_object(dataframe, index=False)
    
    
    results ={}
    
    def function_process(row, result_dic):
        key = row['key']
        if key in result_dic:
            return result_dic[key]
        process_result = row['X'] + row['Y']
        result_dic[key] = process_result
        return process_result
    
    dataframe['results'] = dataframe.apply(lambda x:function_process(x,results),axis=1)
    print(dataframe)