I have a small dataset that I'm trying to process so that I can later train a model with it. This is a dataset in a csv file with two columns: Category and Message, which is a simple dataset with messages that may or may not be spam. I'd like to transform this dataset so that Categories are numbers and messages too, but I don't quite understand how to do that.
Example data from file:
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
And then I load it like this:
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="directory_to_file",
batch_size=32,
column_names=['Category','Message'],
column_defaults=[tf.string,tf.string],
label_name='Category',
field_delim=',',
header=True,
num_epochs=1,
)
For example, I tried something like this:
def standarize_dataset(dataset):
lowercase = tf.strings.lower(dataset)
return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
vectorization = layers.TextVectorization(
standardize=standarize_dataset,
max_tokens=1000,
output_mode='int',
output_sequence_length=200,
)
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
vectorization.adapt(dataset_unbatched)
But then I get an error:
TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.
Looping over this dataset shows that Message is e.g.
OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])
and Category:
[b'ham']
I can probably just create a loop that will extract only the message from each OrderedDict, but I feel like there is a better way to read this data and then process it, so as in: How to easily process texts from a CSV file in Tensorflow?
By modifying the .unbatch().map()
operation, I got the code running.
Please note that your standarize_dataset()
function did not work after my modification and it returned TypeError: not all arguments converted during string formatting
. However, your function can be substituted by specifying standardize="lower_and_strip_punctuation"
in layers.TextVectorization()
.
Full code below:
import re
import string
import tensorflow as tf
import tensorflow.keras.layers as layers
file_as_str = """
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
"""
with open("example.txt", "w") as f:
f.write(file_as_str)
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="example.txt",
batch_size=32,
column_names=['Category','Message'],
column_defaults=[tf.string, tf.string],
label_name='Category',
field_delim=',',
header=True,
num_epochs=1,
)
def standarize_dataset(dataset):
lowercase = tf.strings.lower(dataset)
return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
vectorization = layers.TextVectorization(
standardize="lower_and_strip_punctuation",
max_tokens=1000,
output_mode='int',
output_sequence_length=200
)
dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])
vectorization.adapt(dataset_unbatched)
vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)
# prints:
# tf.Tensor(
# [46 8 5 68 64 10 54 2 11 7 52 47 17 66 33 67 22 7 2 65 2 24 8 26
# 16 25 61 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0], shape=(200,), dtype=int64)