I have a text file containing one sentence per line
When I create a TextLineDataset and iterate on it with an iterator it returns the file line by line
I want to iterate through my file two tokens at a time, here's my current code:
sentences = tf.data.TextLineDataset("data/train.src")
iterator = sentences.make_initializable_iterator()
next_element = iterator.get_next()
sess = tf.Session()
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
elem = sess.run(next_element)
print(elem)
Is it possible to do so using a TextLineDataset ?
EDIT : By "tokens" I mean "words"
Absolutely this is possible but you have a little bit of wrangling to do. You need to:
We can use tf.strings.split
for 1.:
words = sentences.map(tf.strings.split)
and flat_map
for 2.:
flat_words = words.flat_map(tf.data.Dataset.from_tensor_slices)
and batch
for 3:
word_pairs = flat_words.batch(2)
and, of course, we could chain all these operations together to give us something like this:
word_pairs = sentences \
.map(tf.strings.split) \
.flat_map(tf.data.Dataset.from_tensor_slices) \
.batch(2)