I have the following code, with a custom standardization definition.
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
regex = tf.strings.regex_replace(lowercase, r'[^\w]', ' ')
regex = tf.strings.regex_replace(regex, ' +', ' ')
return tf.strings.split(regex)
vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
standardize=custom_standardization,
max_tokens=50000,
output_mode="int",
output_sequence_length=100,
)
But when I call adapt
, like this, I got the next error
vectorize_layer.adapt(['the cat'])
# Error:
InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'the given axis (axis = 2) is not squeezable!'
According to their explication,
When using a custom callable for split, the data received by the callable will have the 1st dimension squeezed out - instead of [["string to split"], ["another string to split"]], the Callable will see ["string to split", "another string to split"]. The callable should return a Tensor with the first dimension containing the split tokens - in this example, we should see something like [["string", "to", "split"], ["another", "string", "to", "split"]]. This makes the callable site natively compatible with tf.strings.split().
But I can't see where the error is
EDIT: I'VE DONE SOME RESEARCH IN MY CODE
When I pass an array like ['The other day was raining', 'Please call me later']
, the function custom_standardization()
returns something like this
[['the', 'other', 'day', 'was', 'raining'], ['pleasse', 'call', 'me', 'later']]
So it seems that it is not respecting to have same shape. Why it changes thought?
I referred the document you shared earlier. Following was mentioned for custom standardize
When using a custom callable for standardize, the data received by the callable will be exactly as passed to this layer. The callable should return a tensor of the same shape as the input.
So I changed replaced the return tf.strings.split(regex)
with return regex
(as splitting is changing the shape here). Please try like this.
import tensorflow as tf
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
regex = tf.strings.regex_replace(lowercase, r'[^\w]', ' ')
regex = tf.strings.regex_replace(regex, ' +', ' ')
return regex
vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
standardize=custom_standardization,
max_tokens=50000,
output_mode="int",
output_sequence_length=100,
)
#checking input shape and output shape are shape or not
input = tf.constant([["foo ! @ qux #bar"], ["qux baz"]])
print(input)
print(custom_standardization(input))
vectorize_layer.adapt(["foo qux bar"])
Providing gist for reference.