I want to convert text to sequence using keras with indonesian languages. but the keras tokenizer only detect the known word.
How to add known words in keras? or any solution for me to convert text to sequence?
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=n_most_common_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(concated['TITLE'].values)
txt = ["bisnis di indonesia sangat maju"]
seq = list(tokenizer.texts_to_sequences_generator(txt))
the "seq" variable resulting empty array if i used indonesian languages, its work perfectly if i used the english word. how to use keras for different languages? or anyway to add some known word to keras?
Thanks
Keras doesn't know any languages or words. You create the vocabulary using the fit_on_texts
or fit_on_sequences
method.
I guess you are fit
ting the tokenizer on some English text (i.e., concated['TITLE'].values
). As a result, the internal vocabulary contains only English words (and no Indonesian words). This explains why seq
will be empty if txt
only contains non-English words.
Also, you can take a look at the source code of the Tokenizer
class.