I am given a 2D numpy array and a huge pandas DataFrame. A dummy example of them would look somewhat like this:
arr = np.array([[1648137283, 0],
[1648137284, 1],
[1648137285, 2],
[1648137286, 3],
.....
[1658137287, 4],
[1658137288, 5],
[1658137289, 6]])
df.head(-6)
unix ... value_a
0 1643137283 ... 23
1 1643137284 ... 54
2 1643137285 ... 25
... ... ... ...
10036787 1653174068 ... 75
10036788 1653174069 ... 65
10036789 1653174070 ... 23
In the first column of arr
is a unix timestamp and in the second an id-value. The DataFrame also has a column for the unix timestamp. My goal is to map the id-value from arr
based on the unix timestamp to the corresponding timestamp of df
in a separate new column called 'index'.
Now, these are probably important notes:
df
contains only a portion of all timestamps from arr
df
and arr
have different lengths along the axis=0
df
are ordered in sequences and repeat themselvesarr
contains all unix timestamps from df
but not the way aroundunit='ms'
, some timestamps are off by +/-1 or +/-2, however, in my use cases they can bee seen as identicalI could do this within a loop or with np.where()
. However, as arr
and df
are quite large, I was hoping for a fast solution.
The idea is to convert the numpy array to a mapping containing key-val pairs, where key is unix timestamps and value is correponding id, then you can use series.map
to substitute/map the values in the given dataframe
df['index'] = df['unix'].map(dict(arr))
Sample output
unix ... value_a index
0 1643137283 ... 23 0
1 1643137284 ... 54 1
2 1643137285 ... 25 2
10036787 1653174068 ... 75 3
10036788 1653174069 ... 65 5
10036789 1653174070 ... 23 6