Error calling adapt in TextVectorization Keras

I have the following code, with a custom standardization definition.

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    regex = tf.strings.regex_replace(lowercase, r'[^\w]', ' ')
    regex = tf.strings.regex_replace(regex, ' +', ' ')

    return tf.strings.split(regex)

vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize=custom_standardization,
    max_tokens=50000,
    output_mode="int",
    output_sequence_length=100,
)

But when I call adapt, like this, I got the next error

vectorize_layer.adapt(['the cat'])
# Error:
InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'the given axis (axis = 2) is not squeezable!'

According to their explication,

When using a custom callable for split, the data received by the callable will have the 1st dimension squeezed out - instead of [["string to split"], ["another string to split"]], the Callable will see ["string to split", "another string to split"]. The callable should return a Tensor with the first dimension containing the split tokens - in this example, we should see something like [["string", "to", "split"], ["another", "string", "to", "split"]]. This makes the callable site natively compatible with tf.strings.split().

Blockquote Source

But I can't see where the error is

EDIT: I'VE DONE SOME RESEARCH IN MY CODE When I pass an array like ['The other day was raining', 'Please call me later'], the function custom_standardization() returns something like this

[['the', 'other', 'day', 'was', 'raining'], ['pleasse', 'call', 'me', 'later']]

So it seems that it is not respecting to have same shape. Why it changes thought?

Solution

I referred the document you shared earlier. Following was mentioned for custom standardize

When using a custom callable for standardize, the data received by the callable will be exactly as passed to this layer. The callable should return a tensor of the same shape as the input.

So I changed replaced the return tf.strings.split(regex) with return regex (as splitting is changing the shape here). Please try like this.

import tensorflow as tf

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    regex = tf.strings.regex_replace(lowercase, r'[^\w]', ' ')
    regex = tf.strings.regex_replace(regex, ' +', ' ')

    return regex

vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize=custom_standardization,
    max_tokens=50000,
    output_mode="int",
    output_sequence_length=100,
)

#checking input shape and output shape are shape or not 
input = tf.constant([["foo !  @ qux  #bar"], ["qux baz"]])
print(input)
print(custom_standardization(input))

vectorize_layer.adapt(["foo qux bar"])

Providing gist for reference.