Search code examples
pythonpandasdataframegeometrydata-analysis

Python | Pandas Dataframe: find for each entry in df A the closest timestamp in df B


I am new to python pandas. What I have are 2 pandas dataframes. Among other data both of them contain a timestamp column.

Assume we have df A

x y z timestamp
1 2 3 1.4
4 5 6 1.73
7 8 9 4.1

and df B is:

x y z timestamp
7 4 1 1.7
8 5 2 1.73
9 6 3 3.5
4 5 6 4.8

I would like to compute for each row in A the difference to the position in B which is closest to the timestamp in A. We can assume that the df are both sorted by timestamp. However these timestamps do not share the same start or end time but do certainly have some overlap.

Furthermore the two data frames are not necessarily of same length. I have a brute force implementation in place which does exactly what I want and which I also can easily extend to potentially interpolate between timestamps -- something which I want to achieve in an improved version. However, my implementation is terribly slow.

I am sure there is a more performant way of implementing the following:

idxA = 0
idxB = 0
endA = len(A)
endB = len(B)

while idxA < endA and idxB < endB:
  currentA_ts = A['timestamp'][idxA]
  currentB_ts = B['timestamp'][idxB]
  if idxB < endB-1:
    nextB_ts = B['timestamp'][idxB+1]
    if abs(currentB_ts - currentA_ts) > abs(nextB_ts - currentA_ts):
      idxB += 1

  currentClosestB_row = B.iloc[idxB]
  currentA_row = A.iloc[idxA]

  B_location = currentClosestB_row[['x','y','z']]
  A_location = currentA_row[['x', 'y', 'z']]

  direction = get_direction_vector(B_location, A_location)

  currentA_row['dir_x'] = direction[0]
  currentA_row['dir_y'] = direction[1]
  currentA_row['dir_z'] = direction[2]

  out_df.append(currentA_row)

  idxA += 1

I hope that code snippet clarifies what I try to achieve. But as mentioned above, this is terribly slow as the df A and B both have several 100k entries.

I see two ways of improving the above code:

  1. The general structure of how I try to achieve the described goal.
  2. I can imagine that how I use python and pandas is not optimal. I am using pandas for the very first time, also python is not my main programming language - so please let me know in case you see something that can be improved.

Any feedback on how to speed up that code is highly appreciated.

Many thanks in advance.


Solution

  • Matching rows to the closest values is called an asof merge, i.e. a “left join except that we match on nearest key rather than equal keys” − both columns need to be sorted.

    >>> pd.merge_asof(df1, df2, on='timestamp', suffixes=('_a', '_b'), direction='nearest')
       x_a  y_a  z_a  timestamp  x_b  y_b  z_b
    0    1    2    3       1.40    7    4    1
    1    4    5    6       1.73    8    5    2
    2    7    8    9       4.10    9    6    3
    

    If you want to be able to subtract the 2 timestamp columns, you need the named differently. You can add suffixes before the merge:

    >>> df = pd.merge_asof(df1.add_suffix('_a'), df2.add_suffix('_b'), direction='nearest',
    ...                    left_on='timestamp_a', right_on='timestamp_b')
    >>> df['delta'] = df['timestamp_a'] - df['timestamp_b']
    >>> df
       x_a  y_a  z_a  timestamp_a  x_b  y_b  z_b  timestamp_b  delta
    0    1    2    3         1.40    7    4    1         1.70   -0.3
    1    4    5    6         1.73    8    5    2         1.73    0.0
    2    7    8    9         4.10    9    6    3         3.50    0.6