What is the sequence for preprocessing text df with tensorflow?

I have a pandas data frame, containing two columns: sentences and annotations:

Col 0	Sentence	Annotation
1	[This, is, sentence]	[l1, l2, l3]
2	[This, is, sentence, too]	[l1, l2, l3, l4]

There are several things I need to do:

split to features and labels
split into train-val-test data

vectorize train data using:

  vectorize_layer = tf.keras.layers.TextVectorization(
     max_tokens=maxlen,
     standardize='lower',
     split='whitespace',
     ngrams=(1,3),
     output_mode='tf-idf',
     pad_to_max_tokens=True,)

I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?

Solution

You can split your dataset before creating a model. After splitting you need to tokenize your sentences using

tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)

After tokenizing you need to add padding to the sentence using

training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)

Then you can train your model with the data. For more details please refer to this working code example. Thank You.