python tensorflow tensorflow-datasets language-model

Ngrams from Tensorflow TextLineDataset

I have a text file containing one sentence per line

When I create a TextLineDataset and iterate on it with an iterator it returns the file line by line

I want to iterate through my file two tokens at a time, here's my current code:

sentences = tf.data.TextLineDataset("data/train.src")
iterator = sentences.make_initializable_iterator()
next_element = iterator.get_next()

sess = tf.Session()

sess.run(tf.tables_initializer())
sess.run(iterator.initializer)

elem = sess.run(next_element)
print(elem)

Is it possible to do so using a TextLineDataset ?

EDIT : By "tokens" I mean "words"

Solution

Absolutely this is possible but you have a little bit of wrangling to do. You need to:

split each line into words
flatten this to a single stream of words
batch into 2's

We can use tf.strings.split for 1.:

words = sentences.map(tf.strings.split)

and flat_map for 2.:

flat_words = words.flat_map(tf.data.Dataset.from_tensor_slices)

and batch for 3:

word_pairs = flat_words.batch(2)

and, of course, we could chain all these operations together to give us something like this:

word_pairs = sentences \
  .map(tf.strings.split) \
  .flat_map(tf.data.Dataset.from_tensor_slices) \
  .batch(2)