I have a grid with evenly spaced points 20x20 cm apart and I want to interpolate values (scipy.interpolate.griddata(method="cubic")
) stored in a Pandas Dataframe with columns "x", "y" and "value". I used a section of the DF for testing, and after applying my script to the entire DF, I noticed that the interpolation results are different. The difference is not big, but it is enough to cause differences in my overall results after further processing, which makes quick testing of parameters with a sample section almost impossible...
The sample section is in the middle of the points of the whole dataset and the grid as well as the parameters used are the same for both iterations (sample section and whole dataset). The points and values used for the sample section are the same as in the whole dataset for that area. I expected that the edges of the sample section would give different results after interpolation, but not the area in the middle. This kind of behavoir appears also with differnt interpolation methods like linear
or nearest
using scipy.interpolate.griddata
. Is there something to change or add to prevent this kind of behavoir (e.g. differnt library or approach)?
This is my code:
import numpy as np
import pandas as pd
import geopandas as gpd
from scipy import interpolate
# Import data
df_grid = pd.read_table("whole_dataset.csv", sep=",", header=0)
df = pd.read_table("sample_section.csv", sep=",", header=0) or df = pd.read_table("whole_dataset.csv", sep=",", header=0)
# Define the regular grid for interpolation
x_min = df_grid['x'].min()
x_max = df_grid['x'].max()
y_min = df_grid['y'].min()
y_max = df_grid['y'].max()
spacing = 0.2
x_grid = np.arange(x_min, x_max + spacing, spacing)
y_grid = np.arange(y_min, y_max + spacing, spacing)
# Create a mesh grid from the regular grid
X, Y = np.meshgrid(x_grid, y_grid)
# Interpolate values using 'cubic' method
points = df[['x', 'y']].values
values = df['value'].values
interpolated_values = interpolate.griddata(points, values, (X, Y), method='cubic')
# Convert the interpolated values to a DataFrame with 'x' and 'y' coordinates
data = pd.DataFrame({
'x': X.ravel(),
'y': Y.ravel(),
'value': interpolated_values.ravel()
})
# Drop all Null values
data = data.dropna().reset_index(drop=True)
gdf = gpd.GeoDataFrame(data, geometry=gpd.points_from_xy(data['x'], data['y']))
gdf_combined.to_file("data.geojson", driver='GeoJSON', crs="EPSG:32633")
And this are the differences in the results I get:
Here are the two CSV-files whole_dataset.csv
and sample_section.csv
(Google Drive). The sample_section
is extracted using QGIS in order to take an area from the spatial center of the dataset and then saved as CSV. The CRS of the coordinates is WGS 84 / UTM zone 33N - EPSG:32633.
The GIF is also screenshots from QGIS, where the two layers of the interpolated grid are overlaid with the same symbolization (QGIS style file (qml) also in Google Drive)
I have a grid with evenly spaced points 20x20 cm apart and I want to interpolate values (
scipy.interpolate.griddata(method="cubic")
) stored in a Pandas Dataframe with columns "x", "y" and "value".
I think there are two problems here: one is a theoretical problem with using griddata here, and the other is a numerical stability problem.
When SciPy is interpolating this data, it starts by figuring out which points are relevant. It does this by finding triangles between each point, using an algorithm called a Delauney triangulation. Then, it interpolates between points by finding the three points on the triangle and interpolating between those values.
griddata()
doesn't let you directly inspect this triangulation, but you can call the class it's wrapping, and get the info directly.
interpolator_sample = CloughTocher2DInterpolator(points, values, fill_value=np.nan, rescale=rescale)
interpolated_values = interpolator_sample((X, Y))
scipy.spatial.delaunay_plot_2d(interpolator_sample.tri)
plt.xlim([516207.788138, 516232.388138])
plt.ylim([5.533969e+06, 5.533987e+06])
Plot:
You can apply the same thing to the full data set:
There are three things I want to point out about this.
They're not the same triangles. (If you're unconvinced of this, I suggest using "Open image in tab" on both plots and flipping between them.) When the sample dataset interpolator and full dataset interpolator are interpolating between data points, they are interpolating between different data points.
The triangulation contains many triangles which are too small. (They are not visible on this plot, but there are triangles with angles 0, 0, 180 connecting many of these points.) Delaunay triangulation is supposed to avoid creating triangles where any of the angles are near zero, but the triangulations above contain many triangles which are very narrow.
The input points are on a rotated grid. This means that there are multiple possible triangulations of this data which are equally valid.
This question has a great explanation of why: interpolate.griddata results inconsistent
I believe this argument still holds even when the grid is rotated/skewed like this. This would imply that there is no way to make griddata()
deterministic when adding more data, and that any approach using it is doomed.
I experimented with a few ideas to improve this, while measuring success using the following plot.
data_merged = data_interp_sample.merge(data_interp_full, on=['x', 'y'], suffixes=['_sample', '_full'])
ax = data_merged.plot.scatter(x='value_sample', y='value_full', alpha=0.1)
ax.axline((0, 0), slope=1, color='r')
This plot shows all of the common values from the interpolation of the sample and the interpolation of the full dataset. The red line is along y = x, and it represents the ideal: that every point is exactly the same in both interpolations.
Plot:
I tried several things to improve this, but one thing that seemed surprisingly helpful was subtracting the mean of each XY value, and adding it back after interpolation.
df_grid = pd.read_table("whole_dataset.csv", sep=",", header=0)
df = pd.read_table("sample_section.csv", sep=",", header=0)
df_grid_mean_x = df_grid['x'].mean()
df_grid_mean_y = df_grid['y'].mean()
df['x'] = df['x'] - df_grid_mean_x
df_grid['x'] = df_grid['x'] - df_grid_mean_x
df['y'] = df['y'] - df_grid_mean_y
df_grid['y'] = df_grid['y'] - df_grid_mean_y
# ... snip unchanged code ...
# Convert the interpolated values to a DataFrame with 'x' and 'y' coordinates
data = pd.DataFrame({
'x': X.ravel() + df_grid_mean_x,
'y': Y.ravel() + df_grid_mean_y,
'value': interpolated_values.ravel()
})
This shouldn't make a difference - a Delauney triangulation should be exactly the same even when the input coordinates are translated by any constant. However, it seems to produce a much more reasonable triangulation. I believe that this may be caused by floating-point imprecision errors inside of Qhull, and subtracting the mean allows it to calculate triangle angles more precisely. This avoids creating the tiny triangles.
As a result, this triangulation is much more similar between sample and full, but it is not fully consistent.