Search code examples
pythonpandasdataframenumpyunix-timestamp

Assign values from Numpy Array to Pandas DataFrame based on almost matching unix timestamp


I am given a 2D numpy array and a huge pandas DataFrame. A dummy example of them would look somewhat like this:

arr = np.array([[1648137283, 0],
                [1648137284, 1],
                [1648137285, 2],
                [1648137286, 3],
                .....
                [1658137287, 4],
                [1658137288, 5],
                [1658137289, 6]])

df.head(-6)
            unix         ...   value_a 
0           1643137283   ...     23
1           1643137284   ...     54
2           1643137285   ...     25
...          ...         ...     ...   
10036787    1653174068   ...     75
10036788    1653174069   ...     65
10036789    1653174070   ...     23

In the first column of arr is a unix timestamp and in the second an id-value. The DataFrame also has a column for the unix timestamp. My goal is to map the id-value from arr based on the unix timestamp to the corresponding timestamp of df in a separate new column called 'index'.

Now, these are probably important notes:

  • df contains only a portion of all timestamps from arr
  • df and arr have different lengths along the axis=0
  • the timestamps in df are ordered in sequences and repeat themselves
  • arr contains all unix timestamps from df but not the way around
  • about 1% of the unix values do not match perfectly. My unix is in unit='ms', some timestamps are off by +/-1 or +/-2, however, in my use cases they can bee seen as identical

I could do this within a loop or with np.where(). However, as arr and df are quite large, I was hoping for a fast solution.


Solution

  • The idea is to convert the numpy array to a mapping containing key-val pairs, where key is unix timestamps and value is correponding id, then you can use series.map to substitute/map the values in the given dataframe

    df['index'] = df['unix'].map(dict(arr))
    

    Sample output

                    unix  ...  value_a  index
    0         1643137283  ...       23      0
    1         1643137284  ...       54      1
    2         1643137285  ...       25      2
    10036787  1653174068  ...       75      3
    10036788  1653174069  ...       65      5
    10036789  1653174070  ...       23      6