Have googled and only relevant answers prescribe a merge which isn't applicable in my case.
I have two data frames:
da
2023-08-14 06:30:01 B C D E F G
2023-08-14 06:30:01 B C D E F G
2023-08-14 06:30:02 B C D E F G
2023-08-14 06:30:03 B C D E F G
2023-08-14 06:30:04 B C D E F G
2023-08-14 06:30:05 B C D E F G
2023-08-14 06:30:06 A B C E F G
2023-08-14 06:30:07 A B C E F G
db
2023-08-14 06:30:01 28 26 8 -7 -17 -14
2023-08-14 06:30:01 28 26 8 -7 -17 -14
2023-08-14 06:30:02 28 26 8 -5 -17 -14
2023-08-14 06:30:03 28 26 5 -5 -17 -14
2023-08-14 06:30:04 28 26 5 -11 -17 -14
2023-08-14 06:30:05 28 26 5 -11 -17 -10
2023-08-14 06:30:06 33 28 26 -11 -17 -10
2023-08-14 06:30:07 34 28 26 -11 -17 -10
I'd like to return a combination of the two using a unique list of values from da
, in order, as columns, and match column and time to return the corresponding value from db
as value in resulting dataframe like the below:
dc
A B C D E F G
2023-08-14 06:30:01 0 28 26 8 -7 -17 -14
2023-08-14 06:30:01 0 28 26 8 -7 -17 -14
2023-08-14 06:30:02 0 28 26 8 -5 -17 -14
2023-08-14 06:30:03 0 28 26 5 -5 -17 -14
2023-08-14 06:30:04 0 28 26 5 -11 -17 -14
2023-08-14 06:30:05 0 28 26 5 -11 -17 -10
2023-08-14 06:30:06 33 28 26 0 -11 -17 -10
2023-08-14 06:30:07 34 28 26 0 -11 -17 -10
There's a one-to-one correspondence between da
and db
(ie, same number of rows and columns), so could do this row by row but would prefer a solution that doesn't involve iteration as the results aren't path dependent in any way.
I'm able to create column headers for dc
via a map/set
:
from itertools import chain
a = list(map(set,da.values.T))
b = list(set(chain.from_iterable(a)))
dc = pd.DataFrame(columns = b)
but how do I populate the resulting dataframe per the logic above?
Here's one approach:
import pandas as pd
import numpy as np
idx = pd.date_range('2024-01-01', periods=2)
dfa = pd.DataFrame([[*'BCD'], [*'ABD']], index=idx)
dfb = pd.DataFrame([[1, 2, 3]]*2, index=idx)
# dfa
0 1 2
2024-01-01 B C D
2024-01-02 A B D
# dfb
0 1 2
2024-01-01 1 2 3
2024-01-02 1 2 3
Code
cols, col_indices = np.unique(dfa.values, return_inverse=True)
n_rows, n_cols = dfb.shape
data = np.zeros((n_rows, len(cols)), dtype=dfb.values.dtype)
row_indices = np.repeat(np.arange(n_rows), n_cols)
data[row_indices, col_indices] = dfb.values.ravel()
dfc = pd.DataFrame(data, columns=cols, index=dfb.index)
Output:
A B C D
2024-01-01 0 1 2 3
2024-01-02 1 2 0 3
Explanation / intermediates
np.unique
to get sorted unique values for dfa
as columns
+ flat col_indices
via return_inverse=True
.np.unique(dfa.values, return_inverse=True)
(array(['A', 'B', 'C', 'D'], dtype=object),
array([1, 2, 3, 0, 1, 3], dtype=int64))
n_rows
and n_cols
via df.shape
.n_rows
and len(cols)
inside np.zeros
to set up the array for our target, dfc
.data
array([[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int64)
row_indices
using np.repeat
on a range of length n_rows
with n_cols
as repeats
.row_indices
array([0, 0, 0, 1, 1, 1])
df.values
+ np.ravel
to get dfb
as a flattened array and assign to data
based on our indices.pd.DataFrame
to create dfc
, passing the original index.