Search code examples
python-3.xtensorflowtensorflow2.0machine-translation

Tensorflow: Creating a custom text dataset to use in machine translation


I would want to use my own data to train the model for a machine translation system using Transformers. There are a set of datasets already available in TFDS (Tensorflow datasets) and there is also option to add a new dataset to TFDS. But What if I dont have to wait for those add requests and stuff and directly train on my data?

In the example colab notebook, they use the following to create train and validation data:

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

I believe TFDS does a lot of preprocessing to fit into the pipeline and it is of Dataset type.

type(train_examples)

tensorflow.python.data.ops.dataset_ops._OptionsDataset

But for a custom CSV data like the below, how do I create a 'Dataset' compatible for this model?

import pandas as pd 

# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14],['tom', 10], ['nick', 15]]
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

# print dataframe. 
df 

Solution

  • The dataset in the colab notebook is just a collection of pairs of strings (the translation pairs of sentences). This doesn't seem to be what you have there (you have name and age??).

    However, it is certainly possible to create a Dataset from a csv of language pairs (or name and age for that matter!). There is a comprehensive guide to the dataset API here: https://www.tensorflow.org/guide/datasets but essentially, given a csv named "translations.csv" that looks like this:

    hola,hello
    adios,goodbye
    pero,dog
    huevos,eggs
    ...
    

    then we can just do:

    my_dataset = tf.data.experimental.CsvDataset("translations.csv", [tf.string, tf.string])
    

    similarly, for your name/age dataset you could do something like:

    my_dataset = tf.data.experimental.CsvDataset("ages.csv", [tf.string, tf.int32])