Search code examples
pythoncombinations

Pairing with itertools combinations - long code execution on large volume of data


I am working on a task to get all possible combinations (pairs) of ID's (companies) which paritipated in one bid and create a new data frame with ID_1, ID_2, matching parameter (tender ID).

I have prepared the following two functions which provide me with the required result however the execution time exceeds several hours when applying to df with more than 500k lines.

import pandas as pd
from itertools import combinations
def get_pairs(dataframe, items):
   return (
       (c1, c2, item)
       for item in items
       for c1, c2 in combinations(dataframe[dataframe['tender_product'] == item]['ID_name'], 2)
   )
def get_all_pairs(df):
        pairs = get_pairs(df,df['tender_product'])
        df= pd.DataFrame(pairs, columns=['ID_name1', 'ID_name2','tender_product'])
        df=df.query('ID_name1 != ID_name2')
        df=df.drop_duplicates()
        df['pair'] = df['ID_name1'].astype(str)+'_'+df['ID_name2'].astype(str)
        df['reversed_pair'] =df['ID_name1'].astype(str)+'_'+df['ID_name2'].astype(str)
        return df

What are the options to optimize the code to make it work faster with the same result?

The sample of intial data:

ID tender_product
1 tender_1
2 tender_1
3 tender_2
4 tender_2

The sample of current output:

ID1 ID2 tender_product pair reversed_pair
1 2 tender_1 1_2 2_1
3 4 tender_2 3_4 4_3

Solution

  • I'm not sure on how efficient this would be (I don't have the 500K lines to test! :) ) But this should do what you want?

    import pandas as pd
    data = {'ID': [1,2,3,4,5,6,7,8],
            'tender_product': ['tender_1', 'tender_1', 'tender_2', 'tender_2', 'tender_3', 'tender_3', 'tender_4', 'tender_4']}
    df = pd.DataFrame(data)
    
    new_df = df.groupby('tender_product')['ID'].apply(list).reset_index() #Group by tender product and send matching IDs to a list stored in the result column
    new_df.rename(columns={'ID': 'ID1'}, inplace=True) #Rename column to ID1
    new_df['ID2'] = new_df['ID1'].apply(lambda x: x[1]) #Extract value from list at position 1 (second value in list)
    new_df['ID1'] = new_df['ID1'].apply(lambda x: x[0]) #Extract value fromlist at position 0 (first value in list) 
    

    #Note: The order is important here. If we replace the ID1 value first, the list held in that cell will cease to exist because we've replaced it

    Finally, we create the combination of IDs in two new columns

    new_df['pair'] = new_df.apply(lambda row: f"{row['ID1']}_{row['ID2']}", axis=1)
    new_df['reversed_pair'] = new_df.apply(lambda row: f"{row['ID2']}_{row['ID1']}", axis=1)
    

    Tested in Jupyter: enter image description here