Search code examples
pythontensorflowtensorflow-datasetslanguage-model

Ngrams from Tensorflow TextLineDataset


I have a text file containing one sentence per line

When I create a TextLineDataset and iterate on it with an iterator it returns the file line by line

I want to iterate through my file two tokens at a time, here's my current code:

sentences = tf.data.TextLineDataset("data/train.src")
iterator = sentences.make_initializable_iterator()
next_element = iterator.get_next()

sess = tf.Session()

sess.run(tf.tables_initializer())
sess.run(iterator.initializer)

elem = sess.run(next_element)
print(elem)

Is it possible to do so using a TextLineDataset ?

EDIT : By "tokens" I mean "words"


Solution

  • Absolutely this is possible but you have a little bit of wrangling to do. You need to:

    1. split each line into words
    2. flatten this to a single stream of words
    3. batch into 2's

    We can use tf.strings.split for 1.:

    words = sentences.map(tf.strings.split)
    

    and flat_map for 2.:

    flat_words = words.flat_map(tf.data.Dataset.from_tensor_slices)
    

    and batch for 3:

    word_pairs = flat_words.batch(2)
    

    and, of course, we could chain all these operations together to give us something like this:

    word_pairs = sentences \
      .map(tf.strings.split) \
      .flat_map(tf.data.Dataset.from_tensor_slices) \
      .batch(2)