python pandas scipy curve-fitting scipy-optimize

Using SciPy optimize to fit one curve to many data series

I have many days of hourly mean temperatures. These have all been normalised so that the minimum and maximum daily temperature equal 0 and 1 respectively, with each of the hourly mean temperatures being scaled to this range.

df
[out]:
    Date_time   00:00   01:00   02:00   03:00   04:00   05:00   06:00   07:00   08:00   09:00   10:00   11:00   12:00   13:00   14:00   15:00   16:00   17:00   18:00   19:00   20:00   21:00   22:00   23:00   Max     Min
0   2019-02-03  0.0875  0.0868  0.0440  0.0120  0.0108  0.0461  0.0961  0.2787  0.4908  0.6854  0.7379  0.8615  0.9284  0.8488  0.7711  0.2200  0.1617  0.2376  0.2211  0.1782  0.1700  0.1736  0.1174  0.1389  25.7    17.9
1   2019-03-07  0.0432  0.0432  0.0126  0.0011  0.0054  0.0065  0.0121  0.0592  0.2799  0.4322  0.7461  0.7475  0.8130  0.8599  0.6245  0.4815  0.4641  0.3502  0.2126  0.1878  0.1988  0.2114  0.2168  0.2292  21.6    17.9
2   2019-04-21  0.0651  0.0507  0.0324  0.0198  0.0703  0.0454  0.0457  0.2019  0.3700  0.5393  0.6593  0.7556  0.8682  0.9374  0.9593  0.9110  0.8721  0.6058  0.4426  0.3788  0.3447  0.3136  0.2564  0.1414  29.3    15.1

I wish to use SciPy optimize to fit a curve to the hourly pattern of temperature. I can't fit a curve to the full time series dataset because:

I want the fit to be generic, and applicable to any given day
The data is discontinuous, meaning that there would be gaps in the dataset that would cause poor fit

When plotting the full dataset up, this is the result:

It is to this that I'd like to fit a curve to, but all the examples on SciPy use datasets where there is only one value given for any time interval.

One option could be to calculate the mean for each hour and apply SciPy optimize to that. Is there another function I could use to apply a fit to the full dataset in Python?

Solution

As others mentioned there are many ways to fit the data. and it depends on your assumptions and what you are trying to do. Here is an example of doing a simple polynomial fit to your sample data.

First load and massage the data

from io import StringIO
from datetime import datetime as dt
import pandas as pd
import numpy as np
data = StringIO('''
    Date_time   00:00   01:00   02:00   03:00   04:00   05:00   06:00   07:00   08:00   09:00   10:00   11:00   12:00   13:00   14:00   15:00   16:00   17:00   18:00   19:00   20:00   21:00   22:00   23:00   Max     Min
0   2019-02-03  0.0875  0.0868  0.0440  0.0120  0.0108  0.0461  0.0961  0.2787  0.4908  0.6854  0.7379  0.8615  0.9284  0.8488  0.7711  0.2200  0.1617  0.2376  0.2211  0.1782  0.1700  0.1736  0.1174  0.1389  25.7    17.9
1   2019-03-07  0.0432  0.0432  0.0126  0.0011  0.0054  0.0065  0.0121  0.0592  0.2799  0.4322  0.7461  0.7475  0.8130  0.8599  0.6245  0.4815  0.4641  0.3502  0.2126  0.1878  0.1988  0.2114  0.2168  0.2292  21.6    17.9
2   2019-04-21  0.0651  0.0507  0.0324  0.0198  0.0703  0.0454  0.0457  0.2019  0.3700  0.5393  0.6593  0.7556  0.8682  0.9374  0.9593  0.9110  0.8721  0.6058  0.4426  0.3788  0.3447  0.3136  0.2564  0.1414  29.3    15.1
''')
df = pd.read_csv(data, sep = '\s+')
df2 = df.copy()
del df2['Date_time']
del df2['Max']
del df2['Min']

Extract the underlying hours and observatiobs, put into flattened arrays

hours = [dt.strptime(ts, '%H:%M').hour for ts in df2.columns]
raw_data = df2.values.flatten()
hours_rep = np.tile(hours, df2.values.shape[0])

Fit a polynomial of degree deg (set below to 6). This will do a best-fit as input data has multiple observations for the same hour

deg = 6
p = np.polyfit(hours_rep, raw_data, deg = deg)
fit_data = np.polyval(p, hours)

Plot the results

import matplotlib.pyplot as plt
plt.plot(hours, df2.values.T, '.', label = 'obs')
plt.plot(hours, fit_data, 'o-', label = 'fit')
plt.legend(loc = 'best')
plt.show()

This is how it looks like