Search code examples
pythonpandasmergeconcatenation

Using pd.concat to replicate pd.merge Pandas


Let's say I have 3 df such as the ones below:

df = pd.DataFrame({'Week': ['W1', 'W2', 'W3', 'W4', 'W5', 'W6', 'W7'], 'A': [34, 67, 92, 31, 90, 100, 101]})
df2 = pd.DataFrame({'Week': ['W4', 'W5', 'W6', 'W7', 'W8', 'W9', 'W10'], 'B': [75, np.nan, 53, 21, 94, 47, 88]})
df3 = pd.DataFrame({'Week': ['W12', 'W13', 'W14', 'W15', 'W16', 'W17', 'W18'], 'C': [25, 30, 40, 45, 46, 47, 48]})

Normally when I am building my database of stock prices I would use pd.merge and use the very helpful on='Week' (in this instance) function to merge the dataframes together using the Week column. So my code would be something like this:

df = pd.merge(df, df2, on='Week', how='left')
df = pd.merge(df, df3, on='Week', how='left')

Producing the following df:

  Week    A     B   C
0   W1   34   NaN NaN
1   W2   67   NaN NaN
2   W3   92   NaN NaN
3   W4   31  75.0 NaN
4   W5   90   NaN NaN
5   W6  100  53.0 NaN
6   W7  101  21.0 NaN

This is perfect, I only want to see data for W1 - W7, if there's no data I just want NaNs.

I'm led to believe that using pd.concat is much faster than pd.merge and as I am looking at hundreds of stocks this could really help reduce the time it takes to build my df. But it's really important of course that the dates should match exactly, hence why I have been using the on= function in pd.merge.

I can't so far work out how to replicate this behaviour using pd.concat. Does anyone have any suggestions? The sort of thing I have tried so far looks like this:

df = pd.concat([df, df2], sort=True).groupby('Week').mean()

But this results in the following, which isn't even close to what I want:

          A     B
Week             
W1     34.0   NaN
W10     NaN  88.0
W2     67.0   NaN
W3     92.0   NaN
W4     31.0  75.0
W5     90.0   NaN
W6    100.0  53.0
W7    101.0  21.0
W8      NaN  94.0
W9      NaN  47.0

Any help would be really appreciated, cheers

EDIT:

Sorry, just to clarify, my expected output is the df halfway down, this one:

  Week    A     B   C
0   W1   34   NaN NaN
1   W2   67   NaN NaN
2   W3   92   NaN NaN
3   W4   31  75.0 NaN
4   W5   90   NaN NaN
5   W6  100  53.0 NaN
6   W7  101  21.0 NaN

Solution

  • You could do:

    concated = pd.concat([df, df2, df3], sort=False).groupby('Week').first()
    result = concated[concated.index.isin(('W1', 'W2', 'W3', 'W4', 'W5', 'W6', 'W7'))]
    print(result)
    

    Output

              A     B   C
    Week                 
    W1     34.0   NaN NaN
    W2     67.0   NaN NaN
    W3     92.0   NaN NaN
    W4     31.0  75.0 NaN
    W5     90.0   NaN NaN
    W6    100.0  53.0 NaN
    W7    101.0  21.0 NaN