So I have found out that LinearSVC is in TPOT classifier and I have been using it for my model and get a pretty decent score (0.95 in sklearn score).
def process(stock):
df = format_data(stock)
df[['HSI Volume', 'HSI', stock]] = df[['HSI Volume', 'HSI', stock]].pct_change()
# shift future value to current date
df[stock+'_future'] = df[stock].shift(-1)
df.replace([-np.inf, np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
df['class'] = list(map(create_labels, df[stock], df[stock+'_future']))
X = np.array(df.drop(['class', stock+'_future'], 1)) # 1 = column
# X = preprocessing.scale(X)
y = np.array(df['class'])
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
tpot = TPOTClassifier(generations = 10, verbosity=2)
fitting = tpot.fit(X_train, y_train)
prediction = tpot.score(X_test, y_test)
tpot.export('pipeline.py')
return fitting, prediction
After ten generations: TPOT recommended GaussianNB and it scores about 0.77 in sklearn score.
Generation 1 - Current best internal CV score: 0.5322255571
Generation 2 - Current best internal CV score: 0.55453535828
Generation 3 - Current best internal CV score: 0.55453535828
Generation 4 - Current best internal CV score: 0.55453535828
Generation 5 - Current best internal CV score: 0.587469903893
Generation 6 - Current best internal CV score: 0.587469903893
Generation 7 - Current best internal CV score: 0.597194474469
Generation 8 - Current best internal CV score: 0.597194474469
Generation 9 - Current best internal CV score: 0.597194474469
Generation 10 - Current best internal CV score: 0.597194474469
Best pipeline: GaussianNB(RBFSampler(input_matrix, 0.22))
(None, 0.54637855142056824)
I am just curious that why LinearSVC scores higher but TPOT did not recommend. Is it because of the scoring mechanism is different and hence lead to a different optimal classifier?
Thank you so much!
My personal guess is that tpot is stuck on a local maximum, maybe try to change test size, do more generations or scaling the data could help. Also, could you redo the TPOT and see if you get the same results? (My guess is no, since genetic optimization is non-deterministic due to mutation)