Search code examples
python-3.xmachine-learningneural-networkkeras-layer

when training sample increases accuracy decreases


I am testing keras's imdb dataset. questions is, when I split to train and test for 2000 number of words I get close to 87% accuracy:

(X_train, train_labels), (X_test, test_labels) = imdb.load_data(num_words=2000)

but when I bump up the words to like 5000 or 10000, the model perform poorly:

(X_train, train_labels), (X_test, test_labels) = imdb.load_data(num_words=10000)

Here is my model:

model = models.Sequential()

model.add(layers.Dense(256, activation='relu', input_shape=(10000,)))

model.add(layers.Dense(16, activation='relu' ))

model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
            loss='binary_crossentropy',
            metrics=['accuracy'])
history =model.fit(X_train, y_train, epochs=10, batch_size=64,validation_data=(x_val, y_val))

Can any one explain why this is the case. I though with more sample (and less over fitting) I should get a very good model.

Thanks for any advice


Solution

  • Increasing num_words doesn't increase the amount of samples but the vocabulary, leading to more words per sample (statistically), going in the direction of the curse of dimensionality, which is harmful for the model.

    From the docs:

    num_words: integer or None. Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data.