How to create supervised learning dataset from time series data in python

I have a time series data but there are lots of values for one day as below:

[[day1, x1],
 [day1, x2],
 [day1, x3],
 [day2, x4],
 [day2, x5],
 [day3, x6],
 [day4, x7],
 [day4, x8],
 [day4, x9],
  ......]

and so on. I want to turn this time series to supervised learning dataset using python. my expect dataset as below:

[[[all values in day1], [all values in day2]],
 [[all values in day2], [all values in day3]],
 [[all values in day3], [all values in day4]],
 .....]

Does anyone have experience in python with this problem? could you give me an idea?

Solution

I'm going to make some sample data to work with so we can see how the algorithm behaves.

time_series_data = [[1, 0.5],
                    [1, 0.6],
                    [2, 0.3],
                    [3, 0.7],
                    [3, 0.4],
                    [4, 0.1]]

With that out of the way, we can go on to split this list up by the day transitions.

import itertools as it
res = [[time_series_data[0][1]]]

for i, (day, val) in enumerate(it.islice(time_series_data, 1, len(time_series_data))):
    if time_series_data[i][0] != day:
        res.append([val])
    else:
        res[-1].append(val)

Examining the output, we see that all it did was group by day.

>>> res
[[0.5, 0.6], [0.3], [0.7, 0.4], [0.1]]

Then to actually turn that into a supervised learning problem we need input/output pairs.

data = [res[i:i+2] for i in range(0, len(res)-1)]

This has the desired output:

>>>> data
[[[0.5, 0.6], [0.3]],
 [[0.3],      [0.7, 0.4]],
 [[0.7, 0.4], [0.1]]]

One interesting thing about grouping by day is that we no longer necessarily get lists of the same length. Many supervised learning algorithms rely on an idea of vectors of features, where length is preserved in the entire data set. To apply them to more exotic objects, you have to first figure out how to extract fixed-length feature vectors from those objects (where object here refers to, e.g., [0.5, 0.6]).

If you have the same number of data points each day this won't be a problem, but if the number of data points differs AND if the days run together (i.e., the end of your day1 data corresponds to the beginning of your day2 data, or at least something close in time so there isn't a big continuity gap), then you might be more interested in something closer to a sliding window across ALL values rather than those grouped by day. Consider the following:

vals = [val for day, val in time_series_data]

As usual, we examine the output to figure out what's going on here.

>>> vals
[0.5, 0.6, 0.3, 0.7, 0.4, 0.1]

You'll notice that we got rid of the day information completely. With that done though, we can easily construct a form of input/output pairs.

input_length = 2
output_length = 1

X = [vals[i:i+input_length] for i in xrange(0, len(vals)-input_length-output_length+1)]
y = [vals[i:i+output_length] for i in xrange(input_length, len(vals)-output_length+1)]

Now examine the input (which I'm calling X) and the output (which I'm calling y).

>>> X
[[0.5, 0.6],
 [0.6, 0.3],
 [0.3, 0.7],
 [0.7, 0.4]]

>>> y
[[0.3],
 [0.7],
 [0.4],
 [0.1]]

You'll see that there are exactly as many lists in X as there are in y (since these are input/output pairs), and just as importantly every list in X is the same length. Similarly, every list in y is the same length. This kind of problem is much better suited for the bulk of existing machine learning algorithms.

That said, if you have a discontinuity in your data, say from day1 ending at 5:00PM and day2 beginning at 7:00AM the next day, this approach hides the location of that discontinuity in the feature vectors. It may not be an issue though. Depending on what you're doing and what kind of data you have, hopefully this is enough to get started. Have fun, and welcome to machine learning.