Search code examples
pythonpandastensorflowkerasgoogle-trends

Using simple models on Google Trends data to predict something doesn't work as expected


I'm using Google Trends to develop a simple model to predict the future trend for a set of search terms. I took inspiration from this blog post and tried to do basically the same thing to other search terms, trying to find the best models for this kind of task.


The problem is: the predictions for other search terms are completely wrong. I only used terms with a regular pattern, sometimes less regular than the pattern in the example of the blog. Here is my adapted code:

import numpy as np
import pandas as pd
from datetime import date
from matplotlib import pyplot as plt
from keras.models import Sequential
from keras.layers import InputLayer, Reshape, Conv1D, MaxPool1D, Flatten, Dense, LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()



def prepare_data(target, window_X, window_y):
    """ Data preprocessing for multistep forecast """
    X, y = [], []
    start_X = 0
    end_X = start_X + window_X
    start_y = end_X
    end_y = start_y + window_y
    for _ in range(len(target)):
        if end_y < len(target):
            X.append(target[start_X:end_X])
            y.append(target[start_y:end_y])
        start_X += 1
        end_X = start_X + window_X
        start_y += 1
        end_y = start_y + window_y
    X = np.array(X)
    y = np.array(y)
    return np.array(X), np.array(y)


def fit_model(type, X_train, y_train, X_test, y_test, batch_size, epochs):
    """ Training function for network """
    # Model input
    model = Sequential()
    model.add(InputLayer(input_shape=(X_train.shape[1], )))

    if type == 'mlp':
        model.add(Reshape(target_shape=(X_train.shape[1], )))
        model.add(Dense(units=64, activation='relu'))

    if type == 'cnn':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(Conv1D(filters=64, kernel_size=4, activation='relu'))
        model.add(MaxPool1D())
        model.add(Flatten())

    if type == 'lstm':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(LSTM(units=64, return_sequences=False))

    # Output layer
    model.add(Dense(units=64, activation='relu'))
    model.add(Dense(units=y_train.shape[1], activation='sigmoid'))

    # Compile
    model.compile(optimizer='adam', loss='mse')

    # Callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=10)
    model_checkpoint = ModelCheckpoint(filepath='model.h5', save_best_only=True)
    callbacks = [early_stopping, model_checkpoint]

    # Fit model
    model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
              batch_size=batch_size, epochs=epochs, callbacks=callbacks, verbose=2)

    # Load best model
    model.load_weights('model.h5')

    # Return
    return model


# Define windows
window_X = 12
window_y = 6

# Load data
data = pd.read_csv('data/holocaust-world.csv', sep=',')
data = data.set_index(keys=pd.to_datetime(data['month']), drop=True).drop('month', axis=1)

# Scale data
data['y'] = data['y'] / 100.

# Prepare tensors
X, y = prepare_data(target=data['y'].values, window_X=window_X, window_y=window_y)

# Training and test
train = 100
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

# Train models
models = ['mlp', 'cnn', 'lstm']

# Test data
X_test = data['y'].values[-window_X:].reshape(1, -1)

# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*6, 'cnn': [np.nan]*6, 'lstm': [np.nan]*6})
preds = preds.set_index(pd.date_range(start=date(2018, 11, 1), end=date(2019, 4, 1), freq='MS'))

# Fit models and plot
for mod in models:

    # Train models
    model = fit_model(type=mod, X_train=X_train, y_train=y_train, X_test=X_valid, y_test=y_valid, batch_size=16, epochs=1000)

    # Predict
    p = model.predict(x=X_test)

    # Fill
    preds[mod] = p[0]

# Plot the entire timeline, including the predicted segment
idx = pd.date_range(start=date(2004, 1, 1), end=date(2019, 4, 1), freq='MS')
data = data.reindex(idx)
plt.plot(data['y'], label='Google')

# Plot
plt.plot(preds['mlp'], label='MLP')
plt.plot(preds['cnn'], label='CNN')
plt.plot(preds['lstm'], label='LSTM')
plt.legend()
plt.show()

Here i tried evaluating the interest in the theme of the holocaust, which is also periodic (peak in january, you can grab the csv from Google Trends site obviously). Here are the results: Results


So the questions are:

  • how can i adapt this model to use every month available (at the time of writing, until august 2019)? When i try to do that, i have weird behaviors, so i manually deleted everything after october 2018 in the csv for now.

  • how can i improve those simple models to actually give useful and meaningful results? I wonder why the example in the blog post works perfectly, while my attempts fail miserably.

Thanks in advance!


Solution

  • Increase the number of predictions you test and you should get better results.

    window_y = 49
    ...
    # Predictions
    preds = pd.DataFrame({'mlp': [np.nan]*49, 'cnn': [np.nan]*49, 'lstm': [np.nan]*49})
    preds = preds.set_index(pd.date_range(start=date(2015, 1, 1), end=date(2019, 1, 1), freq='MS'))
    

    Playing with the training/test set will also help:

    # Training and test
    train = 50
    X_train = X[:train]
    y_train = y[:train]
    X_valid = X[train:]
    y_valid = y[train:]
    

    enter image description here

    However, this particular trend is periodic but also affected by other factors. Phrophet can help you dealing with this kind of trends better than a simple machine learning model.