Search code examples
pythonkerasnlpdeep-learningtokenize

tokenizer.texts_to_sequences Keras Tokenizer gives almost all zeros


I am working to create a text classification code but I having problems in encoding documents using the tokenizer.

1) I started by fitting a tokenizer on my document as in here:

vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size, filters='')
tokenizer.fit_on_texts(df['data'])

2) Then I wanted to check if my data is fitted correctly so I converted into sequence as in here:

sequences = tokenizer.texts_to_sequences(df['data'])
data = pad_sequences(sequences, maxlen= num_words) 
print(data) 

which gave me fine output. i.e. encoded words into numbers

[[ 9628  1743    29 ...   161    52   250]
 [14948     1    70 ...    31   108    78]
 [ 2207  1071   155 ... 37607 37608   215]
 ...
 [  145    74   947 ...     1    76    21]
 [   95 11045  1244 ...   693   693   144]
 [   11   133    61 ...    87    57    24]]

Now, I wanted to convert a text into a sequence using the same method. Like this:

sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=num_words)
print(text)

it gave me weird output:

[[   0    0    0    0    0    0    0    0    0  394]
 [   0    0    0    0    0    0    0    0    0 3136]
 [   0    0    0    0    0    0    0    0    0 1383]
 [   0    0    0    0    0    0    0    0    0  507]
 [   0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0    0    0 1261]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0 1114]
 [   0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0    0    0 1261]
 [   0    0    0    0    0    0    0    0    0  753]]

According to Keras documentation (Keras):

texts_to_sequences(texts)

Arguments: texts: list of texts to turn to sequences.

Return: list of sequences (one per text input).

is it not supposed to encode each word to its corresponding number? then pad the text if it shorter than 50 to 50? Where is the mistake ?


Solution

  • I guess you should call like this:

    sequences = tokenizer.texts_to_sequences(["physics is nice "])