Search code examples
pythonpandastime-series

Efficiently Resampling and Interpolating Pandas DataFrames with Millisecond Accuracy


I have a Pandas DataFrame with timestamps that have millisecond accuracy and corresponding altitude values. I want to resample and interpolate this data efficiently. Here is a simple example:

import pandas as pd
import numpy as np

# Generate 5 random timestamps within the same minute with millisecond accuracy
base_timestamp = pd.Timestamp.now().floor("min")  # Get the current time, floored to the nearest minute
timestamps = [
    base_timestamp + pd.to_timedelta(np.random.randint(0, 60000), unit="ms")
    for _ in range(5)
]

# Generate random altitudes
altitudes = np.random.uniform(100, 1000, size=5)  # Random altitudes between 100 and 1000

# Create the DataFrame
df = pd.DataFrame({"timestamp": timestamps, "altitude": altitudes}).sort_values("timestamp")

A method that works but is terribly inefficient is the following:

df_interpolated = (
    df.set_index("timestamp").resample("1ms").interpolate().resample("1s").interpolate()
)

Plotting the results shows that it works:

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=df.timestamp, y=df.altitude, mode="lines", name="Original"))

fig.add_trace(
    go.Scatter(
        x=df_interpolated.index,
        y=df_interpolated.altitude,
        mode="markers",
        name="Interpolated 1ms and 1s",
    )
)
fig.show()

results

Any idea on how to do this better?


Solution

  • Code

    Extract the time in 1s increments within the bin, and add it to the index. reindex with the added index, then interpolate with argument time. And func asfreq can leave only 1s increments.

    df1 = df.set_index("timestamp")
    times = df1.index.to_period('1s').to_timestamp('1s')
    idx = pd.date_range(times.min(), times.max(), freq='1s')
    df_interpolated = (
        df1.reindex(df1.index.union(idx))
           .interpolate('time')
           .asfreq('1s')
    )
    

    I think this could be more efficient.(Because it don't resample in ms)

    Check with Example

    import pandas as pd
    import numpy as np
    
    np.random.seed(0) # need seed if random sample
    base_timestamp = pd.Timestamp.now().floor("min")
    timestamps = [
        base_timestamp + pd.to_timedelta(np.random.randint(0, 60000), unit="ms")
        for _ in range(5)
    ]
    
    altitudes = np.random.uniform(100, 1000, size=5)
    
    # Create the DataFrame
    df = (pd.DataFrame({"timestamp": timestamps, "altitude": altitudes})
            .sort_values("timestamp").reset_index(drop=True)
    )
    

    df:

                    timestamp    altitude
    0 2024-06-25 21:31:02.732  872.151056
    1 2024-06-25 21:31:42.613  661.207327
    2 2024-06-25 21:31:43.567  862.526565
    3 2024-06-25 21:31:45.891  367.781146
    4 2024-06-25 21:31:52.416  445.943537
    

    graph (run your plot code)

    enter image description here