Search code examples
pythonpandasdataframemulti-index

Multi-indexed Dataframe Join (Keep updated data if not NaN and Append on new Indices)


I have two DataFrames with multiple indices named df_base and df_updates. I want to combine these DataFrames into a single DataFrame and keep the multi indices.

>>> import numpy as np
>>> import pandas as pd
>>> df_base = pd.DataFrame(
...     {
...         "price": {
...             ("2019-01-01", "1001"): 100,
...             ("2019-01-01", "1002"): 100,
...             ("2019-01-01", "1003"): 100,
...             ("2019-01-02", "1001"): 100,
...             ("2019-01-02", "1002"): 100,
...             ("2019-01-02", "1003"): 100,
...             ("2019-01-03", "1001"): 100,
...             ("2019-01-03", "1002"): 100,
...             ("2019-01-03", "1003"): 100,
...         }
...     },
... )
>>> df_base.index.names = ["date", "id"]
>>> df_base.convert_dtypes()
                 price
date       id         
2019-01-01 1001    100
           1002    100
           1003    100
2019-01-02 1001    100
           1002    100
           1003    100
2019-01-03 1001    100
           1002    100
           1003    100
>>>
>>> df_updates = pd.DataFrame(
...     {
...         "price": {
...             ("2019-01-01", "1001"): np.nan,
...             ("2019-01-01", "1002"): 100,
...             ("2019-01-01", "1003"): 100,
...             ("2019-01-02", "1001"): 100,
...             ("2019-01-02", "1002"): 100,
...             ("2019-01-02", "1003"): 100,
...             ("2019-01-03", "1001"): 100,
...             ("2019-01-03", "1002"): 100,
...             ("2019-01-03", "1003"): 100,
...         }
...     }
... )
>>> df_updates.index.names = ["date", "id"]
>>> df_updates.convert_dtypes()
                 price
date       id         
2019-01-01 1001   <NA>
           1002     99
           1003     99
           1004    100

I want to combine them with the following rules:

  1. Keep the old data if the new data is not specified (NaN)
  2. Append the new data if the indices doesn't exist in the base DataFrame

I already tried using .join but it raise an error

>>> df_base.join(df_updates)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[48], line 21
     ...

ValueError: columns overlap but no suffix specified: Index(['price'], dtype='object')

even if I added suffix, it just make the data more complex (need another solution)

I also already tried using .update, but the new data with different indices from base were not included in the results

>>> df_base.update(df_updates)
>>> df_base
                 price
date       id         
2019-01-01 1001  100.0
           1002   99.0
           1003   99.0
2019-01-02 1001  100.0
           1002  100.0
           1003  100.0
2019-01-03 1001  100.0
           1002  100.0
           1003  100.0

And the last, I also try a "tricky" operation

>>> df_base.update(df_updates)
>>> df_base = df_updates.combine_first(df_base)
>>> df_base
                 price
date       id         
2019-01-01 1001  100.0
           1002   99.0
           1003   99.0
           1004  100.0
2019-01-02 1001  100.0
           1002  100.0
           1003  100.0
2019-01-03 1001  100.0
           1002  100.0
           1003  100.0

It is the result that I expected, but I'm not sure if it's the best solution for this case, I try using %timeit, and the results are

>>> %timeit df_base.update(df_updates)
345 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit df_updates.combine_first(df_base)
1.36 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

when using large data the results are

>>> %timeit df_base.update(df_updates)
2.38 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df_updates.combine_first(df_base)
9.65 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Is that the best solution for my case or is there any more efficient/optimized function (I expected a single liner pandas function)? Thanks!

Edit 1: Full Code

import numpy as np
import pandas as pd

df_base = pd.DataFrame(
    {
        "price": {
            ("2019-01-01", "1001"): 100,
            ("2019-01-01", "1002"): 100,
            ("2019-01-01", "1003"): 100,
            ("2019-01-02", "1001"): 100,
            ("2019-01-02", "1002"): 100,
            ("2019-01-02", "1003"): 100,
            ("2019-01-03", "1001"): 100,
            ("2019-01-03", "1002"): 100,
            ("2019-01-03", "1003"): 100,
        }
    },
)
df_base.index.names = ["date", "id"]
df_base.convert_dtypes()


df_updates = pd.DataFrame(
    {
        "price": {
            ("2019-01-01", "1001"): np.nan,
            ("2019-01-01", "1002"): 100,
            ("2019-01-01", "1003"): 100,
            ("2019-01-02", "1001"): 100,
            ("2019-01-02", "1002"): 100,
            ("2019-01-02", "1003"): 100,
            ("2019-01-03", "1001"): 100,
            ("2019-01-03", "1002"): 100,
            ("2019-01-03", "1003"): 100,
        }
    }
)
df_updates.index.names = ["date", "id"]
df_updates.convert_dtypes()

df_base.update(df_updates)
df_base = df_updates.combine_first(df_base)
df_base

Solution

  • You shouldn't need to update then combine_first, just combine_first:

    df_base = df_updates.combine_first(df_base)
    

    Output:

                     price
    date       id         
    2019-01-01 1001  100.0
               1002   99.0
               1003   99.0
               1004  100.0
    2019-01-02 1001  100.0
               1002  100.0
               1003  100.0
    2019-01-03 1001  100.0
               1002  100.0
               1003  100.0