I'm working with two distinct data frames, AlphaDF
and BetaDF
, each containing unique columns. My goal is to merge AlphaDF
with data from BetaDF
, keeping every row from AlphaDF
and adding corresponding data from BetaDF
where there's a match.
Here's the script I initially wrote:
combined_rows = []
for index, alpha_row in AlphaDF.iterrows():
match = False
for b_index, beta_row in BetaDF.iterrows():
if alpha_row["alpha_id"] == beta_row["beta_id"] and alpha_row['start_point'] >= beta_row['range_start'] and alpha_row['end_point'] <= beta_row['range_end']:
match = True
combined_rows.append(alpha_row.tolist() + beta_row.tolist())
break
if not match:
combined_rows.append(alpha_row.tolist() + [np.nan] * len(BetaDF.columns))
merged_dataframe = pd.DataFrame(combined_rows, columns=AlphaDF.columns.tolist() + BetaDF.columns.tolist())
To optimize this, I tried vectorized operations in pandas:
BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')
condition = (merged_dataframe['start_point'] >= merged_dataframe['range_start']) & (merged_dataframe['end_point'] <= merged_dataframe['range_end'])
merged_dataframe = merged_dataframe[condition]
merged_dataframe.fillna(np.nan, inplace=True)
However, this method doesn't preserve every row in AlphaDF
. Here's a sample of the data frames:
AlphaDF:
alpha_id, start_point, end_point, reference, score, strand, feature_count, label, value1, value2, value3
A123, 1000, 1050, ref:A123:1000-1050, 0.75, +, 2, primary, NaN, NaN, NaN
A124, 1070, 1100, ref:A124:1070-1100, 0.80, -, 1, secondary, NaN, NaN, NaN
...
BetaDF:
gene_id, alpha_id, range_start, range_end, orientation, is_partial, gene_type, data_version, data_source, gene_function, significance
101, A123, 950, 1075, forward, 0, coding, v1, SourceA, FunctionX, 0.05
102, A124, 1060, 1120, reverse, 1, non-coding, v1, SourceB, FunctionY, 0.03
...
Any suggestions on a more efficient and accurate approach using pandas operations to achieve this merging task?
This happens because merged_dataframe[condition]
does not have all AlphaDF rows meeting the condition.
You can first merge and , set all BetaDF columns to empty values using ~condition
opposite.
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', None)
pd.options.display.expand_frame_repr = False
BetaDF = BetaDF.rename(columns={'beta_id': 'alpha_id'})
merged_dataframe = pd.merge(AlphaDF, BetaDF, how='left', on='alpha_id')
condition = ((merged_dataframe['start_point'] >= merged_dataframe['range_start']) &
(merged_dataframe['end_point'] <= merged_dataframe['range_end']))
merged_dataframe.loc[~condition, BetaDF.columns[0]:] = np.nan
print(merged_dataframe)
Update: Removed the second option since 'beta_id' and 'alpha_id' may be on different rows.