Search code examples
pythontensorflowtensorflow-datasets

How to easily process texts from a CSV file in Tensorflow?


I have a small dataset that I'm trying to process so that I can later train a model with it. This is a dataset in a csv file with two columns: Category and Message, which is a simple dataset with messages that may or may not be spam. I'd like to transform this dataset so that Categories are numbers and messages too, but I don't quite understand how to do that.

Example data from file:

ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"

And then I load it like this:

dataset = tf.data.experimental.make_csv_dataset(
                file_pattern="directory_to_file",
                batch_size=32,
                column_names=['Category','Message'],
                column_defaults=[tf.string,tf.string],
                label_name='Category',
                field_delim=',',
                header=True,
                num_epochs=1,
            )

For example, I tried something like this:

def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')

vectorization = layers.TextVectorization(
            standardize=standarize_dataset,
            max_tokens=1000,
            output_mode='int',
            output_sequence_length=200,
        )
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
        
vectorization.adapt(dataset_unbatched)

But then I get an error:

TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.

Looping over this dataset shows that Message is e.g.

OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])

and Category:

[b'ham']

I can probably just create a loop that will extract only the message from each OrderedDict, but I feel like there is a better way to read this data and then process it, so as in: How to easily process texts from a CSV file in Tensorflow?


Solution

  • By modifying the .unbatch().map() operation, I got the code running.

    Please note that your standarize_dataset() function did not work after my modification and it returned TypeError: not all arguments converted during string formatting. However, your function can be substituted by specifying standardize="lower_and_strip_punctuation" in layers.TextVectorization().

    Full code below:

    import re
    import string
    
    import tensorflow as tf
    import tensorflow.keras.layers as layers
    
    file_as_str = """
    ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
    ham,Ok lar... Joking wif u oni...
    spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
    ham,U dun say so early hor... U c already then say...
    ham,"Nah I don't think he goes to usf, he lives around here though"
    """
    
    with open("example.txt", "w") as f:
        f.write(file_as_str)
    
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern="example.txt",
        batch_size=32,
        column_names=['Category','Message'],
        column_defaults=[tf.string, tf.string],
        label_name='Category',
        field_delim=',',
        header=True,
        num_epochs=1,
    )
    
    
    def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
    
    
    vectorization = layers.TextVectorization(
        standardize="lower_and_strip_punctuation",
        max_tokens=1000,
        output_mode='int',
        output_sequence_length=200
     )
    
    
    dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])
    
    
    vectorization.adapt(dataset_unbatched)
    
    
    vectorized_text = vectorization(next(iter(dataset_unbatched)))
    print(vectorized_text)
    
    # prints:
    # tf.Tensor(
    # [46  8  5 68 64 10 54  2 11  7 52 47 17 66 33 67 22  7  2 65  2 24  8 26
    #  16 25 61 69  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    #   0  0  0  0  0  0  0  0], shape=(200,), dtype=int64)