Search code examples
pythonsqlpandasdataframejoin

Pandas merge on inequality


So I have been searching a pandas equivalent of this SQL query

SELECT * FROM table1
LEFT JOIN table2
ON table1.columnX>=table2.columnY

*Note that I am joining by an inequality condition, >=, not by matching columns.

But it seems panda's merge is only able to join by exact matches (like select * from table1 LEFT JOIN table2 ON table1.columnX=table2.columnY)

It doesn't seems it supports joining by more complex conditions, like a column having a value greater than the other column. Which SQL does support

I have found many resources that say it does not supports that, and that the only way to do so is to do a cartersian product first and then filter the resulting dataframe. Or pre-filtering the dataframes before joining. However, a cartesian product is costly

But those sources I have found are from more than 5 years ago. Is it still the same case today that panda's merge can only join by matching columns exactly and it does not admit inequalities (<,>,<=,>=,between)?

Here are some old resources I have found regarding this:

Inequality joins in Pandas?

Merge pandas dataframes where one value is between two others

Best way to join / merge by range in pandas


Solution

  • In Pandas, the merge function is primarily intended for combining based on exact matches between columns.

    You can use this code as an example :

    import pandas as pd
    
    # Sample data
    data1 = {'columnX': [10, 15, 20]}
    data2 = {'columnY': [5, 12, 18]}
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    
    # Perform a cartesian product
    cartesian = pd.merge(df1.assign(key=1), df2.assign(key=1), on='key').drop('key', axis=1)
    
    # Filter based on inequality condition
    result_df = cartesian[cartesian['columnX'] >= cartesian['columnY']]
    
    print(result_df)