Search code examples
pythonpandasnumpyscikit-learnransac

How to detect contiguous spans in which data changes linearly within a DataFrame?


I'm trying to detect contiguous spans in which the relevant variable changes linearly within certain data in a DataFrame. There may be many spans within the data that satisfy this. I started my aproach using ransac based on Robust linear model estimation using RANSAC. However I'm having issues using the example for my data.

Objective

Detect contiguous spans in which the relevant variable changes linearly within data. The spans to be detected are composed by more than 20 consecutive data points. The desired output would be the range dates in which the contiguous spans are placed.

Toy example

In the toy exmple code below I generate random data and then set two portions of the data to create a contiguous spans that vary linearly. Then I try to fit a linear regression model to the data. The rest of the code I used (which is not shown here) is just the rest of the code in the Robust linear model estimation using RANSAC page. However I know I would need to change that remaining code in order to reach the goal.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np

## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])

## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1

## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2

## 4. Plot data
df.plot()
plt.show()

## 5. Create arrays
X = np.asarray(df.index)
y = np.asarray(df.data.tolist())

## 6. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)

For this toy example code a desired output (which I wasn't able to code yet) would be a DataFrame like this:

>>> out
              start               end
0  2016-08-10 08:15  2016-08-10 15:00
1  2016-08-10 17:00  2016-08-10 22:30

The graph generated looks like: Data generated

Error code

However when step 6 is executed I get below error:

ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

I would like to be able to detect in this example both contiguous spans in which the relevant variable changes linearly (line1 and line2). But I'm not being able to implement the example stated on the ransac code example.

Question

What should I modify in my code to be able to continue? And, may there be a better approach to achieve to detect the contiguous spans in which the relevant variable changes linearly?


Solution

  • To just go on and fit your linear regression, you will have to do the following:

    lr.fit(X.reshape(-1,1), y)
    

    It is because sklearn is waiting for a 2d array of values, with each row being a row of features.

    So after this would you like to fit models for many different ranges and see if you find spans of linear change?

    If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:

    dff = df.diff()
    dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
    out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x: \
        [x.min(), x.max()] if len(x) > 20 else None).dropna()))
    

    Output would be:

    >>> out
                        0                   1
    0 2016-08-10 08:30:00 2016-08-10 15:00:00
    1 2016-08-10 17:15:00 2016-08-10 22:30:00
    

    If you are trying to do something similar, but for float data, I would do something using diff the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.