So I have been searching a pandas equivalent of this SQL query
SELECT * FROM table1
LEFT JOIN table2
ON table1.columnX>=table2.columnY
*Note that I am joining by an inequality condition, >=
, not by matching columns.
But it seems panda's merge is only able to join by exact matches (like select * from table1 LEFT JOIN table2 ON table1.columnX=table2.columnY
)
It doesn't seems it supports joining by more complex conditions, like a column having a value greater than the other column. Which SQL does support
I have found many resources that say it does not supports that, and that the only way to do so is to do a cartersian product first and then filter the resulting dataframe. Or pre-filtering the dataframes before joining. However, a cartesian product is costly
But those sources I have found are from more than 5 years ago. Is it still the same case today that panda's merge can only join by matching columns exactly and it does not admit inequalities (<,>,<=,>=,between)?
Here are some old resources I have found regarding this:
Merge pandas dataframes where one value is between two others
In Pandas, the merge function is primarily intended for combining based on exact matches between columns.
You can use this code as an example :
import pandas as pd
# Sample data
data1 = {'columnX': [10, 15, 20]}
data2 = {'columnY': [5, 12, 18]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Perform a cartesian product
cartesian = pd.merge(df1.assign(key=1), df2.assign(key=1), on='key').drop('key', axis=1)
# Filter based on inequality condition
result_df = cartesian[cartesian['columnX'] >= cartesian['columnY']]
print(result_df)