I am working on a task to get all possible combinations (pairs) of ID's (companies) which paritipated in one bid and create a new data frame with ID_1, ID_2, matching parameter (tender ID).
I have prepared the following two functions which provide me with the required result however the execution time exceeds several hours when applying to df with more than 500k lines.
import pandas as pd
from itertools import combinations
def get_pairs(dataframe, items):
return (
(c1, c2, item)
for item in items
for c1, c2 in combinations(dataframe[dataframe['tender_product'] == item]['ID_name'], 2)
)
def get_all_pairs(df):
pairs = get_pairs(df,df['tender_product'])
df= pd.DataFrame(pairs, columns=['ID_name1', 'ID_name2','tender_product'])
df=df.query('ID_name1 != ID_name2')
df=df.drop_duplicates()
df['pair'] = df['ID_name1'].astype(str)+'_'+df['ID_name2'].astype(str)
df['reversed_pair'] =df['ID_name1'].astype(str)+'_'+df['ID_name2'].astype(str)
return df
What are the options to optimize the code to make it work faster with the same result?
The sample of intial data:
ID | tender_product |
---|---|
1 | tender_1 |
2 | tender_1 |
3 | tender_2 |
4 | tender_2 |
The sample of current output:
ID1 | ID2 | tender_product | pair | reversed_pair |
---|---|---|---|---|
1 | 2 | tender_1 | 1_2 | 2_1 |
3 | 4 | tender_2 | 3_4 | 4_3 |
I'm not sure on how efficient this would be (I don't have the 500K lines to test! :) ) But this should do what you want?
import pandas as pd
data = {'ID': [1,2,3,4,5,6,7,8],
'tender_product': ['tender_1', 'tender_1', 'tender_2', 'tender_2', 'tender_3', 'tender_3', 'tender_4', 'tender_4']}
df = pd.DataFrame(data)
new_df = df.groupby('tender_product')['ID'].apply(list).reset_index() #Group by tender product and send matching IDs to a list stored in the result column
new_df.rename(columns={'ID': 'ID1'}, inplace=True) #Rename column to ID1
new_df['ID2'] = new_df['ID1'].apply(lambda x: x[1]) #Extract value from list at position 1 (second value in list)
new_df['ID1'] = new_df['ID1'].apply(lambda x: x[0]) #Extract value fromlist at position 0 (first value in list)
#Note: The order is important here. If we replace the ID1 value first, the list held in that cell will cease to exist because we've replaced it
Finally, we create the combination of IDs in two new columns
new_df['pair'] = new_df.apply(lambda row: f"{row['ID1']}_{row['ID2']}", axis=1)
new_df['reversed_pair'] = new_df.apply(lambda row: f"{row['ID2']}_{row['ID1']}", axis=1)