I have a Pandas DataFrame with timestamps that have millisecond accuracy and corresponding altitude values. I want to resample and interpolate this data efficiently. Here is a simple example:
import pandas as pd
import numpy as np
# Generate 5 random timestamps within the same minute with millisecond accuracy
base_timestamp = pd.Timestamp.now().floor("min") # Get the current time, floored to the nearest minute
timestamps = [
base_timestamp + pd.to_timedelta(np.random.randint(0, 60000), unit="ms")
for _ in range(5)
]
# Generate random altitudes
altitudes = np.random.uniform(100, 1000, size=5) # Random altitudes between 100 and 1000
# Create the DataFrame
df = pd.DataFrame({"timestamp": timestamps, "altitude": altitudes}).sort_values("timestamp")
A method that works but is terribly inefficient is the following:
df_interpolated = (
df.set_index("timestamp").resample("1ms").interpolate().resample("1s").interpolate()
)
Plotting the results shows that it works:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=df.timestamp, y=df.altitude, mode="lines", name="Original"))
fig.add_trace(
go.Scatter(
x=df_interpolated.index,
y=df_interpolated.altitude,
mode="markers",
name="Interpolated 1ms and 1s",
)
)
fig.show()
Any idea on how to do this better?
Code
Extract the time in 1s increments within the bin, and add it to the index. reindex
with the added index, then interpolate
with argument time
. And func asfreq
can leave only 1s increments.
df1 = df.set_index("timestamp")
times = df1.index.to_period('1s').to_timestamp('1s')
idx = pd.date_range(times.min(), times.max(), freq='1s')
df_interpolated = (
df1.reindex(df1.index.union(idx))
.interpolate('time')
.asfreq('1s')
)
I think this could be more efficient.(Because it don't resample in ms
)
Check with Example
import pandas as pd
import numpy as np
np.random.seed(0) # need seed if random sample
base_timestamp = pd.Timestamp.now().floor("min")
timestamps = [
base_timestamp + pd.to_timedelta(np.random.randint(0, 60000), unit="ms")
for _ in range(5)
]
altitudes = np.random.uniform(100, 1000, size=5)
# Create the DataFrame
df = (pd.DataFrame({"timestamp": timestamps, "altitude": altitudes})
.sort_values("timestamp").reset_index(drop=True)
)
df:
timestamp altitude
0 2024-06-25 21:31:02.732 872.151056
1 2024-06-25 21:31:42.613 661.207327
2 2024-06-25 21:31:43.567 862.526565
3 2024-06-25 21:31:45.891 367.781146
4 2024-06-25 21:31:52.416 445.943537
graph (run your plot code)