Search code examples
pandasvectorization

Merge two pandas data frames with conservation


I'm working with two distinct data frames, AlphaDF and BetaDF, each containing unique columns. My goal is to merge AlphaDF with data from BetaDF, keeping every row from AlphaDF and adding corresponding data from BetaDF where there's a match.

Here's the script I initially wrote:

combined_rows = []
for index, alpha_row in AlphaDF.iterrows():
    match = False
    for b_index, beta_row in BetaDF.iterrows():
        if alpha_row["alpha_id"] == beta_row["beta_id"] and alpha_row['start_point'] >= beta_row['range_start'] and alpha_row['end_point'] <= beta_row['range_end']:
            match = True
            combined_rows.append(alpha_row.tolist() + beta_row.tolist())
            break

    if not match:
        combined_rows.append(alpha_row.tolist() + [np.nan] * len(BetaDF.columns))

merged_dataframe = pd.DataFrame(combined_rows, columns=AlphaDF.columns.tolist() + BetaDF.columns.tolist())

To optimize this, I tried vectorized operations in pandas:

BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')

condition = (merged_dataframe['start_point'] >= merged_dataframe['range_start']) & (merged_dataframe['end_point'] <= merged_dataframe['range_end'])
merged_dataframe = merged_dataframe[condition]

merged_dataframe.fillna(np.nan, inplace=True)

However, this method doesn't preserve every row in AlphaDF. Here's a sample of the data frames:

AlphaDF:

alpha_id, start_point, end_point, reference, score, strand, feature_count, label, value1, value2, value3
A123, 1000, 1050, ref:A123:1000-1050, 0.75, +, 2, primary, NaN, NaN, NaN
A124, 1070, 1100, ref:A124:1070-1100, 0.80, -, 1, secondary, NaN, NaN, NaN
...

BetaDF:

gene_id, alpha_id, range_start, range_end, orientation, is_partial, gene_type, data_version, data_source, gene_function, significance
101, A123, 950, 1075, forward, 0, coding, v1, SourceA, FunctionX, 0.05
102, A124, 1060, 1120, reverse, 1, non-coding, v1, SourceB, FunctionY, 0.03
...

Any suggestions on a more efficient and accurate approach using pandas operations to achieve this merging task?


Solution

  • This happens because merged_dataframe[condition] does not have all AlphaDF rows meeting the condition.

    You can first merge and , set all BetaDF columns to empty values using ~condition opposite.

    import pandas as pd
    import numpy as np
    
    pd.set_option('display.max_rows', None)  
    pd.options.display.expand_frame_repr = False
    
    BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
    merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')
    
    
    condition = ((merged_dataframe['start_point'] >= merged_dataframe['range_start']) &
                 (merged_dataframe['end_point'] <= merged_dataframe['range_end']))
    
    merged_dataframe.loc[~condition, BetaDF.columns[0]:] = np.nan
    
    print(merged_dataframe)
    

    Update: Removed the second option since 'beta_id' and 'alpha_id' may be on different rows.