Search code examples
pythonpandasdataframeleft-joinself-join

How to apply Self Join on Pandas Data Frame


I am stuck basically joining a Pandas series with a DataFrame.

Let's generate some dummy data using code below

test = pd.DataFrame({'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
                              '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10',
                              '2021-01-11', '2021-01-12', '2021-01-13', '2021-01-14'],
                     'New_Date': [np.nan, '2021-01-01', '2021-01-01', '2021-01-04', '2021-01-03',
                                    '2021-01-06', '2021-01-08', '2021-01-08', '2021-01-09', '2021-01-11',
                                    '2021-01-11', '2021-01-13', '2021-01-13', np.nan],
                     'Price': [1, 1, 5, 3, 4, 3, 2, 5, 6, 4, 3, 2, 1, 7]})
test['Date'] = pd.to_datetime(test['Date'])
test['New_Date'] = pd.to_datetime(test['New_Date'])
test.set_index('Date', inplace=True)

Actual Df

+------------+------------+-------+
|    Date    |  New_Date  | Price |
+------------+------------+-------+
| 01/01/2021 | NaT        |     1 |
| 02/01/2021 | 01/01/2021 |     1 |
| 03/01/2021 | 01/01/2021 |     5 |
| 04/01/2021 | 04/01/2021 |     3 |
| 05/01/2021 | 03/01/2021 |     4 |
| 06/01/2021 | 06/01/2021 |     3 |
| 07/01/2021 | 08/01/2021 |     2 |
| 08/01/2021 | 08/01/2021 |     5 |
| 09/01/2021 | 09/01/2021 |     6 |
| 10/01/2021 | 11/01/2021 |     4 |
| 11/01/2021 | 11/01/2021 |     3 |
| 12/01/2021 | 13/01/2021 |     2 |
| 13/01/2021 | 13/01/2021 |     1 |
| 14/01/2021 | NaT        |     7 |
+------------+------------+-------+

Desired Output

+------------+------------+-------+-----------+
|    Date    |  New_Date  | Price | New_Price |
+------------+------------+-------+-----------+
| 01/01/2021 | NaT        |     1 | NaN       |
| 02/01/2021 | 01/01/2021 |     1 | 1         |
| 03/01/2021 | 01/01/2021 |     5 | 1         |
| 04/01/2021 | 04/01/2021 |     3 | 3         |
| 05/01/2021 | 03/01/2021 |     4 | 5         |
| 06/01/2021 | 06/01/2021 |     3 | 3         |
| 07/01/2021 | 08/01/2021 |     2 | 5         |
| 08/01/2021 | 08/01/2021 |     5 | 5         |
| 09/01/2021 | 09/01/2021 |     6 | 6         |
| 10/01/2021 | 11/01/2021 |     4 | 3         |
| 11/01/2021 | 11/01/2021 |     3 | 3         |
| 12/01/2021 | 13/01/2021 |     2 | 1         |
| 13/01/2021 | 13/01/2021 |     1 | 1         |
| 14/01/2021 | NaT        |     7 | NaN       |
+------------+------------+-------+-----------+

I want to create a column New_Price by using New_Date as an index and Joining with Date to get Price which will be named as New_Price.

As per this link

I've tried the solution below:

test['New_Price'] = test['Price'][test['New_Date']].values

The above solution fails due to NaT, this is not basically an actual join so I tried an another way

test.join(test.drop('New_Date', 1), on='New_Date', rsuffix='_y')

This solves the problem as I just need to rename Price_y to New_Price but if there are 20 columns in test df how can I keep columns from left df and only price column named as New_Price which would come from right df. Is there any elegant way to achieve this?


Solution

  • Let's try join only on the Price column aligned on New_Date:

    new_df = test.join(test['Price'].rename('New_Price'), on='New_Date')
    

    new_df

                 New_Date  Price  New_Price
    Date                                   
    2021-01-01        NaT      1        NaN
    2021-01-02 2021-01-01      1        1.0
    2021-01-03 2021-01-01      5        1.0
    2021-01-04 2021-01-04      3        3.0
    2021-01-05 2021-01-03      4        5.0
    2021-01-06 2021-01-06      3        3.0
    2021-01-07 2021-01-08      2        5.0
    2021-01-08 2021-01-08      5        5.0
    2021-01-09 2021-01-09      6        6.0
    2021-01-10 2021-01-11      4        3.0
    2021-01-11 2021-01-11      3        3.0
    2021-01-12 2021-01-13      2        1.0
    2021-01-13 2021-01-13      1        1.0
    2021-01-14        NaT      7        NaN