Let's say I have 3 df
such as the ones below:
df = pd.DataFrame({'Week': ['W1', 'W2', 'W3', 'W4', 'W5', 'W6', 'W7'], 'A': [34, 67, 92, 31, 90, 100, 101]})
df2 = pd.DataFrame({'Week': ['W4', 'W5', 'W6', 'W7', 'W8', 'W9', 'W10'], 'B': [75, np.nan, 53, 21, 94, 47, 88]})
df3 = pd.DataFrame({'Week': ['W12', 'W13', 'W14', 'W15', 'W16', 'W17', 'W18'], 'C': [25, 30, 40, 45, 46, 47, 48]})
Normally when I am building my database of stock prices I would use pd.merge
and use the very helpful on='Week'
(in this instance) function to merge the dataframes together using the Week column. So my code would be something like this:
df = pd.merge(df, df2, on='Week', how='left')
df = pd.merge(df, df3, on='Week', how='left')
Producing the following df
:
Week A B C
0 W1 34 NaN NaN
1 W2 67 NaN NaN
2 W3 92 NaN NaN
3 W4 31 75.0 NaN
4 W5 90 NaN NaN
5 W6 100 53.0 NaN
6 W7 101 21.0 NaN
This is perfect, I only want to see data for W1 - W7, if there's no data I just want NaNs.
I'm led to believe that using pd.concat
is much faster than pd.merge
and as I am looking at hundreds of stocks this could really help reduce the time it takes to build my df
. But it's really important of course that the dates should match exactly, hence why I have been using the on=
function in pd.merge
.
I can't so far work out how to replicate this behaviour using pd.concat
. Does anyone have any suggestions? The sort of thing I have tried so far looks like this:
df = pd.concat([df, df2], sort=True).groupby('Week').mean()
But this results in the following, which isn't even close to what I want:
A B
Week
W1 34.0 NaN
W10 NaN 88.0
W2 67.0 NaN
W3 92.0 NaN
W4 31.0 75.0
W5 90.0 NaN
W6 100.0 53.0
W7 101.0 21.0
W8 NaN 94.0
W9 NaN 47.0
Any help would be really appreciated, cheers
EDIT:
Sorry, just to clarify, my expected output is the df
halfway down, this one:
Week A B C
0 W1 34 NaN NaN
1 W2 67 NaN NaN
2 W3 92 NaN NaN
3 W4 31 75.0 NaN
4 W5 90 NaN NaN
5 W6 100 53.0 NaN
6 W7 101 21.0 NaN
You could do:
concated = pd.concat([df, df2, df3], sort=False).groupby('Week').first()
result = concated[concated.index.isin(('W1', 'W2', 'W3', 'W4', 'W5', 'W6', 'W7'))]
print(result)
Output
A B C
Week
W1 34.0 NaN NaN
W2 67.0 NaN NaN
W3 92.0 NaN NaN
W4 31.0 75.0 NaN
W5 90.0 NaN NaN
W6 100.0 53.0 NaN
W7 101.0 21.0 NaN