Search code examples
pythonalgorithmdata-structures

Is there a way to do an Excel INDEX/MATCH within Python using two dataframes as inputs


Have googled and only relevant answers prescribe a merge which isn't applicable in my case.

I have two data frames:

da

2023-08-14 06:30:01 B   C   D   E   F   G
2023-08-14 06:30:01 B   C   D   E   F   G
2023-08-14 06:30:02 B   C   D   E   F   G
2023-08-14 06:30:03 B   C   D   E   F   G
2023-08-14 06:30:04 B   C   D   E   F   G
2023-08-14 06:30:05 B   C   D   E   F   G
2023-08-14 06:30:06 A   B   C   E   F   G
2023-08-14 06:30:07 A   B   C   E   F   G

db

2023-08-14 06:30:01 28  26  8   -7  -17 -14
2023-08-14 06:30:01 28  26  8   -7  -17 -14
2023-08-14 06:30:02 28  26  8   -5  -17 -14
2023-08-14 06:30:03 28  26  5   -5  -17 -14
2023-08-14 06:30:04 28  26  5   -11 -17 -14
2023-08-14 06:30:05 28  26  5   -11 -17 -10
2023-08-14 06:30:06 33  28  26  -11 -17 -10
2023-08-14 06:30:07 34  28  26  -11 -17 -10

I'd like to return a combination of the two using a unique list of values from da, in order, as columns, and match column and time to return the corresponding value from db as value in resulting dataframe like the below:

dc

                    A   B   C   D   E   F   G
2023-08-14 06:30:01 0   28  26  8   -7  -17 -14
2023-08-14 06:30:01 0   28  26  8   -7  -17 -14
2023-08-14 06:30:02 0   28  26  8   -5  -17 -14
2023-08-14 06:30:03 0   28  26  5   -5  -17 -14
2023-08-14 06:30:04 0   28  26  5   -11 -17 -14
2023-08-14 06:30:05 0   28  26  5   -11 -17 -10
2023-08-14 06:30:06 33  28  26  0   -11 -17 -10
2023-08-14 06:30:07 34  28  26  0   -11 -17 -10

There's a one-to-one correspondence between da and db (ie, same number of rows and columns), so could do this row by row but would prefer a solution that doesn't involve iteration as the results aren't path dependent in any way.

I'm able to create column headers for dc via a map/set:

from itertools import chain

a = list(map(set,da.values.T))
b = list(set(chain.from_iterable(a)))
dc = pd.DataFrame(columns = b)

but how do I populate the resulting dataframe per the logic above?


Solution

  • Here's one approach:

    Minimal Reproducible Example

    import pandas as pd
    import numpy as np
    
    idx = pd.date_range('2024-01-01', periods=2)
    
    dfa = pd.DataFrame([[*'BCD'], [*'ABD']], index=idx)
    dfb = pd.DataFrame([[1, 2, 3]]*2, index=idx)
    
    # dfa
    
                0  1  2
    2024-01-01  B  C  D
    2024-01-02  A  B  D
    
    # dfb
    
                0  1  2
    2024-01-01  1  2  3
    2024-01-02  1  2  3
    

    Code

    cols, col_indices = np.unique(dfa.values, return_inverse=True)
    
    n_rows, n_cols = dfb.shape
    
    data = np.zeros((n_rows, len(cols)), dtype=dfb.values.dtype)
    
    row_indices = np.repeat(np.arange(n_rows), n_cols)
    data[row_indices, col_indices] = dfb.values.ravel()
    
    dfc = pd.DataFrame(data, columns=cols, index=dfb.index)
    

    Output:

                A  B  C  D
    2024-01-01  0  1  2  3
    2024-01-02  1  2  0  3
    

    Explanation / intermediates

    • Use np.unique to get sorted unique values for dfa as columns + flat col_indices via return_inverse=True.
    np.unique(dfa.values, return_inverse=True)
    
    (array(['A', 'B', 'C', 'D'], dtype=object),
     array([1, 2, 3, 0, 1, 3], dtype=int64))
    
    • Get n_rows and n_cols via df.shape.
    • Use n_rows and len(cols) inside np.zeros to set up the array for our target, dfc.
    data
    
    array([[0, 0, 0, 0],
           [0, 0, 0, 0]], dtype=int64)
    
    • Now, create row_indices using np.repeat on a range of length n_rows with n_cols as repeats.
    row_indices
    
    array([0, 0, 0, 1, 1, 1])
    
    • Use df.values + np.ravel to get dfb as a flattened array and assign to data based on our indices.
    • Finally, use pd.DataFrame to create dfc, passing the original index.