How to easily process texts from a CSV file in Tensorflow?

I have a small dataset that I'm trying to process so that I can later train a model with it. This is a dataset in a csv file with two columns: Category and Message, which is a simple dataset with messages that may or may not be spam. I'd like to transform this dataset so that Categories are numbers and messages too, but I don't quite understand how to do that.

Example data from file:

ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"

And then I load it like this:

dataset = tf.data.experimental.make_csv_dataset(
                file_pattern="directory_to_file",
                batch_size=32,
                column_names=['Category','Message'],
                column_defaults=[tf.string,tf.string],
                label_name='Category',
                field_delim=',',
                header=True,
                num_epochs=1,
            )

For example, I tried something like this:

def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')

vectorization = layers.TextVectorization(
            standardize=standarize_dataset,
            max_tokens=1000,
            output_mode='int',
            output_sequence_length=200,
        )
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
        
vectorization.adapt(dataset_unbatched)

But then I get an error:

TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.

Looping over this dataset shows that Message is e.g.

OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])

and Category:

[b'ham']

I can probably just create a loop that will extract only the message from each OrderedDict, but I feel like there is a better way to read this data and then process it, so as in: How to easily process texts from a CSV file in Tensorflow?

Solution

By modifying the .unbatch().map() operation, I got the code running.

Please note that your standarize_dataset() function did not work after my modification and it returned TypeError: not all arguments converted during string formatting. However, your function can be substituted by specifying standardize="lower_and_strip_punctuation" in layers.TextVectorization().

Full code below:

import re
import string

import tensorflow as tf
import tensorflow.keras.layers as layers

file_as_str = """
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
"""

with open("example.txt", "w") as f:
    f.write(file_as_str)

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern="example.txt",
    batch_size=32,
    column_names=['Category','Message'],
    column_defaults=[tf.string, tf.string],
    label_name='Category',
    field_delim=',',
    header=True,
    num_epochs=1,
)


def standarize_dataset(dataset):
    lowercase = tf.strings.lower(dataset)
    return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')


vectorization = layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    max_tokens=1000,
    output_mode='int',
    output_sequence_length=200
 )


dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])


vectorization.adapt(dataset_unbatched)


vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)

# prints:
# tf.Tensor(
# [46  8  5 68 64 10 54  2 11  7 52 47 17 66 33 67 22  7  2 65  2 24  8 26
#  16 25 61 69  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0], shape=(200,), dtype=int64)