I have two DataFrames with multiple indices named df_base
and df_updates
. I want to combine these DataFrames into a single DataFrame and keep the multi indices.
>>> import numpy as np
>>> import pandas as pd
>>> df_base = pd.DataFrame(
... {
... "price": {
... ("2019-01-01", "1001"): 100,
... ("2019-01-01", "1002"): 100,
... ("2019-01-01", "1003"): 100,
... ("2019-01-02", "1001"): 100,
... ("2019-01-02", "1002"): 100,
... ("2019-01-02", "1003"): 100,
... ("2019-01-03", "1001"): 100,
... ("2019-01-03", "1002"): 100,
... ("2019-01-03", "1003"): 100,
... }
... },
... )
>>> df_base.index.names = ["date", "id"]
>>> df_base.convert_dtypes()
price
date id
2019-01-01 1001 100
1002 100
1003 100
2019-01-02 1001 100
1002 100
1003 100
2019-01-03 1001 100
1002 100
1003 100
>>>
>>> df_updates = pd.DataFrame(
... {
... "price": {
... ("2019-01-01", "1001"): np.nan,
... ("2019-01-01", "1002"): 100,
... ("2019-01-01", "1003"): 100,
... ("2019-01-02", "1001"): 100,
... ("2019-01-02", "1002"): 100,
... ("2019-01-02", "1003"): 100,
... ("2019-01-03", "1001"): 100,
... ("2019-01-03", "1002"): 100,
... ("2019-01-03", "1003"): 100,
... }
... }
... )
>>> df_updates.index.names = ["date", "id"]
>>> df_updates.convert_dtypes()
price
date id
2019-01-01 1001 <NA>
1002 99
1003 99
1004 100
I want to combine them with the following rules:
I already tried using .join
but it raise an error
>>> df_base.join(df_updates)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[48], line 21
...
ValueError: columns overlap but no suffix specified: Index(['price'], dtype='object')
even if I added suffix, it just make the data more complex (need another solution)
I also already tried using .update
, but the new data with different indices from base were not included in the results
>>> df_base.update(df_updates)
>>> df_base
price
date id
2019-01-01 1001 100.0
1002 99.0
1003 99.0
2019-01-02 1001 100.0
1002 100.0
1003 100.0
2019-01-03 1001 100.0
1002 100.0
1003 100.0
And the last, I also try a "tricky" operation
>>> df_base.update(df_updates)
>>> df_base = df_updates.combine_first(df_base)
>>> df_base
price
date id
2019-01-01 1001 100.0
1002 99.0
1003 99.0
1004 100.0
2019-01-02 1001 100.0
1002 100.0
1003 100.0
2019-01-03 1001 100.0
1002 100.0
1003 100.0
It is the result that I expected, but I'm not sure if it's the best solution for this case, I try using %timeit
, and the results are
>>> %timeit df_base.update(df_updates)
345 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit df_updates.combine_first(df_base)
1.36 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
when using large data the results are
>>> %timeit df_base.update(df_updates)
2.38 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df_updates.combine_first(df_base)
9.65 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Is that the best solution for my case or is there any more efficient/optimized function (I expected a single liner pandas function)? Thanks!
import numpy as np
import pandas as pd
df_base = pd.DataFrame(
{
"price": {
("2019-01-01", "1001"): 100,
("2019-01-01", "1002"): 100,
("2019-01-01", "1003"): 100,
("2019-01-02", "1001"): 100,
("2019-01-02", "1002"): 100,
("2019-01-02", "1003"): 100,
("2019-01-03", "1001"): 100,
("2019-01-03", "1002"): 100,
("2019-01-03", "1003"): 100,
}
},
)
df_base.index.names = ["date", "id"]
df_base.convert_dtypes()
df_updates = pd.DataFrame(
{
"price": {
("2019-01-01", "1001"): np.nan,
("2019-01-01", "1002"): 100,
("2019-01-01", "1003"): 100,
("2019-01-02", "1001"): 100,
("2019-01-02", "1002"): 100,
("2019-01-02", "1003"): 100,
("2019-01-03", "1001"): 100,
("2019-01-03", "1002"): 100,
("2019-01-03", "1003"): 100,
}
}
)
df_updates.index.names = ["date", "id"]
df_updates.convert_dtypes()
df_base.update(df_updates)
df_base = df_updates.combine_first(df_base)
df_base
You shouldn't need to update
then combine_first
, just combine_first
:
df_base = df_updates.combine_first(df_base)
Output:
price
date id
2019-01-01 1001 100.0
1002 99.0
1003 99.0
1004 100.0
2019-01-02 1001 100.0
1002 100.0
1003 100.0
2019-01-03 1001 100.0
1002 100.0
1003 100.0