Search code examples
machine-learningsignal-processingtime-seriesclassificationregression

How to use SGD for time series analysis


Is it possible to use stochastic gradient descent for time-series analysis?

My initial idea, given a series of (t, v) pairs where I want an SGD regressor to predict the v associated with t+1, would be to convert the date/time into an integer value, and train the regressor on this list using the hinge loss function. Is this feasible?

Edit: This is example code using the SGD implementation in scikit-learn. However, it fails to properly predict a simple linear time series model. All it seems to do is calculate the average of the training Y-values, and use that as its prediction of the test Y-values. Is SGD just unsuitable for time-series-analysis or am I formulating this incorrectly?

from datetime import date
from sklearn.linear_model import SGDRegressor

# Build data.
s = date(2010,1,1)
i = 0
training = []
for _ in xrange(12):
    i += 1
    training.append([[date(2012,1,i).toordinal()], i])
testing = []
for _ in xrange(12):
    i += 1
    testing.append([[date(2012,1,i).toordinal()], i])

clf = SGDRegressor(loss='huber')

print 'Training...'
for _ in xrange(20):
    try:
        print _
        clf.partial_fit(X=[X for X,_ in training], y=[y for _,y in training])
    except ValueError:
        break

print 'Testing...'
for X,y in testing:
    p = clf.predict(X)
    print y,p,abs(p-y)

Solution

  • SGDRegressor in sklearn is numerically not stable for not scaled input parameters. For good result it's highly recommended that you scale the input variable.

    from datetime import date
    from sklearn.linear_model import SGDRegressor
    
    # Build data.
    s = date(2010,1,1).toordinal()
    i = 0
    training = []
    for _ in range(1,13):
        i += 1
        training.append([[s+i], i])
    testing = []
    for _ in range(13,25):
        i += 1
        testing.append([[s+i], i])
    
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_train = scaler.fit_transform([X for X,_ in training])
    

    after training the SGD regressor, you will have to scale the test input variable accordingly.

    clf = SGDRegressor()
    clf.fit(X=X_train, y=[y for _,y in training])
            
    print(clf.intercept_, clf.coef_)
    
    print('Testing...')
    for X,y in testing:
        p = clf.predict(scaler.transform([X]))
        print(X[0],y,p[0],abs(p[0]-y))
    

    Here is the result:

    [6.31706122] [3.35332573]
    Testing...
    733786 13 12.631164799851827 0.3688352001481725
    733787 14 13.602565350686039 0.39743464931396133
    733788 15 14.573965901520248 0.42603409847975193
    733789 16 15.545366452354457 0.45463354764554254
    733790 17 16.51676700318867 0.48323299681133136
    733791 18 17.488167554022876 0.5118324459771237
    733792 19 18.459568104857084 0.5404318951429161
    733793 20 19.430968655691295 0.569031344308705
    733794 21 20.402369206525506 0.5976307934744938
    733795 22 21.373769757359714 0.6262302426402861
    733796 23 22.34517030819392 0.6548296918060785
    733797 24 23.316570859028133 0.6834291409718674