I am counting the number of unique words in a loaded text file. However I have realized that when I use split()
and tensorflow_datasets tokenize()
I get different results yet I thought they achieve the same thing. Here is my code. Can somebody help me know the difference between the two.
import tensorflow as tf
import tensorflow_datasets as tfds
tf.enable_eager_execution()
BUFFER_SIZE = 50000
TAKE_SIZE = 5000
BATCH_SIZE = 64
tokenizer = tfds.features.text.Tokenizer()
data = open("news.2011.en.shuffled","r").read()
vocab = list(set(data.split())) # gives more count
print(len(vocab))
tokenized_data = tokenizer.tokenize(data)
print(len(set(tokenized_data))) # gives less count
split()
function when passed with no parameter splits only based on white-space characters present in the string.
The tfds.features.text.Tokenizer()
's tokenize()
method has more ways of splitting text rather than only white space character. You can see that in the GitHub code repository. At present, there is no default reserved_tokens
set but the property of alphanum_only
is set to True by default.
Hence, potentially many of the non alpha-numeric characters are getting filtered out and hence you are getting less number of tokens.