Search code examples
pythontensorflowdatasettokenize

Difference between split() and tokenize()


I am counting the number of unique words in a loaded text file. However I have realized that when I use split() and tensorflow_datasets tokenize() I get different results yet I thought they achieve the same thing. Here is my code. Can somebody help me know the difference between the two.

import tensorflow as tf
import tensorflow_datasets as tfds

tf.enable_eager_execution()

BUFFER_SIZE = 50000
TAKE_SIZE = 5000
BATCH_SIZE = 64

tokenizer = tfds.features.text.Tokenizer()
data = open("news.2011.en.shuffled","r").read()
vocab = list(set(data.split()))  # gives more count
print(len(vocab))

tokenized_data = tokenizer.tokenize(data)
print(len(set(tokenized_data)))  # gives less count

Solution

  • split() function when passed with no parameter splits only based on white-space characters present in the string.

    The tfds.features.text.Tokenizer()'s tokenize() method has more ways of splitting text rather than only white space character. You can see that in the GitHub code repository. At present, there is no default reserved_tokens set but the property of alphanum_only is set to True by default.

    Hence, potentially many of the non alpha-numeric characters are getting filtered out and hence you are getting less number of tokens.