Search code examples
tensorflowkerasvectorization

What is the sequence for preprocessing text df with tensorflow?


I have a pandas data frame, containing two columns: sentences and annotations:

Col 0 Sentence Annotation
1 [This, is, sentence] [l1, l2, l3]
2 [This, is, sentence, too] [l1, l2, l3, l4]

There are several things I need to do:

  • split to features and labels

  • split into train-val-test data

  • vectorize train data using:

      vectorize_layer = tf.keras.layers.TextVectorization(
         max_tokens=maxlen,
         standardize='lower',
         split='whitespace',
         ngrams=(1,3),
         output_mode='tf-idf',
         pad_to_max_tokens=True,)
    

I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?


Solution

  • You can split your dataset before creating a model. After splitting you need to tokenize your sentences using

    tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)
    

    After tokenizing you need to add padding to the sentence using

    training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)
    

    Then you can train your model with the data. For more details please refer to this working code example. Thank You.