Merge two pandas data frames with conservation

I'm working with two distinct data frames, AlphaDF and BetaDF, each containing unique columns. My goal is to merge AlphaDF with data from BetaDF, keeping every row from AlphaDF and adding corresponding data from BetaDF where there's a match.

Here's the script I initially wrote:

combined_rows = []
for index, alpha_row in AlphaDF.iterrows():
    match = False
    for b_index, beta_row in BetaDF.iterrows():
        if alpha_row["alpha_id"] == beta_row["beta_id"] and alpha_row['start_point'] >= beta_row['range_start'] and alpha_row['end_point'] <= beta_row['range_end']:
            match = True
            combined_rows.append(alpha_row.tolist() + beta_row.tolist())
            break

    if not match:
        combined_rows.append(alpha_row.tolist() + [np.nan] * len(BetaDF.columns))

merged_dataframe = pd.DataFrame(combined_rows, columns=AlphaDF.columns.tolist() + BetaDF.columns.tolist())

To optimize this, I tried vectorized operations in pandas:

BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')

condition = (merged_dataframe['start_point'] >= merged_dataframe['range_start']) & (merged_dataframe['end_point'] <= merged_dataframe['range_end'])
merged_dataframe = merged_dataframe[condition]

merged_dataframe.fillna(np.nan, inplace=True)

However, this method doesn't preserve every row in AlphaDF. Here's a sample of the data frames:

AlphaDF:

alpha_id, start_point, end_point, reference, score, strand, feature_count, label, value1, value2, value3
A123, 1000, 1050, ref:A123:1000-1050, 0.75, +, 2, primary, NaN, NaN, NaN
A124, 1070, 1100, ref:A124:1070-1100, 0.80, -, 1, secondary, NaN, NaN, NaN
...

BetaDF:

gene_id, alpha_id, range_start, range_end, orientation, is_partial, gene_type, data_version, data_source, gene_function, significance
101, A123, 950, 1075, forward, 0, coding, v1, SourceA, FunctionX, 0.05
102, A124, 1060, 1120, reverse, 1, non-coding, v1, SourceB, FunctionY, 0.03
...

Any suggestions on a more efficient and accurate approach using pandas operations to achieve this merging task?

Solution

This happens because merged_dataframe[condition] does not have all AlphaDF rows meeting the condition.

You can first merge and , set all BetaDF columns to empty values using ~condition opposite.

import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None)  
pd.options.display.expand_frame_repr = False

BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')


condition = ((merged_dataframe['start_point'] >= merged_dataframe['range_start']) &
             (merged_dataframe['end_point'] <= merged_dataframe['range_end']))

merged_dataframe.loc[~condition, BetaDF.columns[0]:] = np.nan

print(merged_dataframe)

Update: Removed the second option since 'beta_id' and 'alpha_id' may be on different rows.