This is a scatter plot which illustrates the relationship between two variables:
It is obvious that this is a non-linear relationship. Both variables are time-series - the points are different observations.
How can I fit a curve (in python) that would approximate it?
EDIT:
Note that this is not a 2D relationship (as JamesPhilips pointed out below).
As I mentioned, these are two time series. I guess the correct thing to do would be to go for a 3D fit (including the time as a third dimension). So the function would take two inputs (x and time). How to do that?
EDIT2:
I'm attaching a sample of that dataset here
EDIT3:
I am fortunate to have received two high quality answers by norok2 and JamesPhilips (many thanks to both of them!) and I will be exploring these. However, my impression is that none of the ideas proposed so far is making significant use of the fact that these are time series. My intuition is that there is some signal there (I know, not having the time stamps is making things complicated). So I will keep the question open for a while in case someone wants to chip in some other ideas.
Also, it seems that the dataset I put a link to was sorted not by using the original index (my bad!) - I am putting a link to the correctly sorted dataset here.
According to the question, the two columns x
and y
are two timeseries, i.e. x(t)
and y(t)
. The t
parameter is represented by the index
First, let's load the data:
import io
import requests
import numpy as np
import scipy as sp
import matplotlib as mpl
import scipy.interpolate
import scipy.ndimage
import matplotlib.pyplot as plt
file_id = '1q4zY7B-BwG8bmbQJT3QvRt6B2MD4k0a0'
url = requests.get('https://drive.google.com/uc?export=download&id=' + file_id)
csv_file = io.StringIO(url.text)
data = np.loadtxt(csv_file, delimiter=',')
x = data[:, 0]
y = data[:, 1]
t = np.arange(len(x))
Now, y(x)
may not, in general, be well defined. A more useful representation of the data is obtained by plotting x(t)
and y(t)
(perhaps along y(x)
):
fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
ax[0, 0].scatter(t, x, color='k', s=8.0)
ax[0, 1].scatter(t, y, color='k', s=8.0)
ax[0, 2].scatter(x, y, color='k', s=8.0)
ax[0, 0].plot(t, x, color='b')
ax[0, 1].plot(t, y, color='b')
ax[0, 2].plot(x, y, color='b')
Note that while the y(x)
visualization gets two clustering, i.e. a stretched spiral and a straight line, without further information, this observation should not be over-interpreted.
Now, without a model to fit, what we could do is to have an interpolant numerical function for x(t)
and y(t)
.
If x(t)
and y(t)
are assumed to be noiseless, then a simple 1D interpolator, as provided by scipy.interpolate.interp1d()
:
func_x_t = sp.interpolate.interp1d(t, x, kind='cubic', assume_sorted=True)
func_y_t = sp.interpolate.interp1d(t, y, kind='cubic', assume_sorted=True)
x_interp = func_x_t(t)
y_interp = func_y_t(t)
fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
ax[0, 0].scatter(t, x, color='k', s=8.0)
ax[0, 1].scatter(t, y, color='k', s=8.0)
ax[0, 2].scatter(x, y, color='k', s=8.0)
ax[0, 0].plot(t, x_interp, color='r')
ax[0, 1].plot(t, y_interp, color='r')
ax[0, 2].plot(x_interp, y_interp, color='r')
Note that the red line is now generated by the interpolator. SciPy offers a variety of different interpolator which may be worth exploring.
If x(t)
and y(t)
are noisy measurements, a more useful interpolator may be obtained as above, but using a de-noised x(t)
and y(t)
. Here, I assume that the high-frequency oscillations observed are driven by noise (both in x(t)
and in y(t)
), and a simple but effective de-noising approach would be Gaussian filtering (as provided by scipy.ndimage.gaussian_filter1d()
:
smooth_x = sp.ndimage.gaussian_filter1d(x, 12.0, mode='nearest')
smooth_y = sp.ndimage.gaussian_filter1d(y, 12.0, mode='nearest')
func_x_t = sp.interpolate.interp1d(t, smooth_x, kind='cubic', assume_sorted=True)
func_y_t = sp.interpolate.interp1d(t, smooth_y, kind='cubic', assume_sorted=True)
x_smooth_interp = func_x_t(t)
y_smooth_interp = func_y_t(t)
fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
ax[0, 0].scatter(t, x, color='k', s=8.0)
ax[0, 1].scatter(t, y, color='k', s=8.0)
ax[0, 2].scatter(x, y, color='k', s=8.0)
ax[0, 0].plot(t, smooth_x, color='g')
ax[0, 1].plot(t, smooth_y, color='g')
ax[0, 2].plot(smooth_x, smooth_y, color='g')
ax[0, 0].plot(t, x_smooth_interp, color='r')
ax[0, 1].plot(t, y_smooth_interp, color='r')
ax[0, 2].plot(x_smooth_interp, y_smooth_interp, color='r')
Note that the *_smooth
and *_smooth_interp
gets plot on top of each other.
Another approach would be to use artificial neural network, e.g. from scikit-learn:
import sklearn as skl
import sklearn.neural_network as skl_nn
import sklearn.preprocessing
x_train = t.reshape(-1, 1)
y_train = data
reg = skl_nn.MLPRegressor(
solver='adam', hidden_layer_sizes=(24, 8), activation='tanh',
learning_rate='adaptive', max_iter=1024)
reg.fit(x_train, y_train)
y_predict = reg.predict(x_train)
x_ann = y_predict[:, 0]
y_ann = y_predict[:, 1]
fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
ax[0, 0].scatter(t, x, color='k', s=8.0)
ax[0, 1].scatter(t, y, color='k', s=8.0)
ax[0, 2].scatter(x, y, color='k', s=8.0)
ax[0, 0].plot(t, x, color='b')
ax[0, 1].plot(t, y, color='b')
ax[0, 2].plot(x, y, color='b')
ax[0, 0].plot(t, x_ann, color='r')
ax[0, 1].plot(t, y_ann, color='r')
ax[0, 2].plot(x_ann, y_ann, color='r')
This gets you to an interpolator without the need to explicitly de-noise your target signal, which may be more or less desireable, depending on the application.
x(t')
and y(t')
with t'(t)
(reordering)If we relax the requirement that x(t)
and y(t)
are from a timeseries, we could investigate x(t')
and y(t')
for a given t'(t)
transformation.
A possible transformation that results a somewhat interesting is obtained by sorting the CSV data by y
(the timeseries are sorted by x
):
data = data[data[:, 1].argsort()]
x = data[:, 0]
y = data[:, 1]
with this transformation, we obtain the following interpolator for the ANN approach:
and this for the smoothed x(t')
and y(t')
:
Possibly, there are more effective reordering, but it may not be simple to formulate them. A relatively simple formulation may involve clustering, but I believe this answer is already quite long.
(full code available here)