Search code examples
pythondataframepython-polars

How to join two Polars dataframes based on a condition?


Given the following 3 Polars dataframes

journeys = pl.DataFrame({'id':[1,2,3,4,5,6],'order_id':[11,12,13,14,14,15],'order_type':['restaurant','restaurant','restaurant','restaurant','grocery','grocery']})

restaurant_orders = pl.DataFrame({'id':[11,12,13,14],'item_count':[4,7,3,5]})

grocery_orders = pl.DataFrame({'id':[14,15],'item_count':[23,21]})

journeys:

enter image description here

restaurant_orders:

enter image description here

grocery_orders:

enter image description here

I want to bring the item_count to the journeys dataframe.

The naive way is to filter the journeys dataframe based on order_type column, perform the join each filtered dataframe and finally concat them back together.

Is there a Polars idiomatic way to perform a conditional (polymorphic?) join to the journeys dataframe based on the value of order_type.


Solution

  • I don't think it is possible to join on a condition with polars.

    One way to achieve the desired result is to concatenate the two dataframe to join and add them a column order_type.

    restaurant_orders = restaurant_orders.with_columns(order_type = pl.lit("restaurant"))
    grocery_orders = grocery_orders.with_columns(order_type = pl.lit("grocery"))
    orders = pl.concat([restaurant_orders, grocery_orders])
    
    journeys.join(orders, left_on=['order_id', 'order_type'], right_on=['id', 'order_type'], how='left')
    

    If you have multiple dataframe to concatenate like this, you can use a function to do it based on the name of the DataFrame:

    def concat_df(list_df: list) -> pl.DataFrame:
      list_newdf = []
    
      for df in list_df:
        name = [x for x in globals() if globals()[x] is df][0]
        name = name[:-7]
    
        list_newdf.append(df.with_columns(order_type = pl.lit(name)))
        
      return pl.concat(list_newdf)
    
    
    list_df = [restaurant_orders, grocery_orders]
    orders = concat_df(list_df)
    
    journeys.join(orders, left_on=['order_id', 'order_type'], right_on=['id', 'order_type'], how='left')
    

    It will produce the same result. Note that this second solution is dependent of the DataFrame name.