Search code examples
pythontensorflowkerasnlp

Error calling adapt in TextVectorization Keras


I have the following code, with a custom standardization definition.

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    regex = tf.strings.regex_replace(lowercase, r'[^\w]', ' ')
    regex = tf.strings.regex_replace(regex, ' +', ' ')

    return tf.strings.split(regex)

vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize=custom_standardization,
    max_tokens=50000,
    output_mode="int",
    output_sequence_length=100,
)

But when I call adapt, like this, I got the next error

vectorize_layer.adapt(['the cat'])
# Error:
InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'the given axis (axis = 2) is not squeezable!'

According to their explication,

When using a custom callable for split, the data received by the callable will have the 1st dimension squeezed out - instead of [["string to split"], ["another string to split"]], the Callable will see ["string to split", "another string to split"]. The callable should return a Tensor with the first dimension containing the split tokens - in this example, we should see something like [["string", "to", "split"], ["another", "string", "to", "split"]]. This makes the callable site natively compatible with tf.strings.split().

Blockquote Source

But I can't see where the error is

EDIT: I'VE DONE SOME RESEARCH IN MY CODE When I pass an array like ['The other day was raining', 'Please call me later'], the function custom_standardization() returns something like this

[['the', 'other', 'day', 'was', 'raining'], ['pleasse', 'call', 'me', 'later']]

So it seems that it is not respecting to have same shape. Why it changes thought?


Solution

  • I referred the document you shared earlier. Following was mentioned for custom standardize

    When using a custom callable for standardize, the data received by the callable will be exactly as passed to this layer. The callable should return a tensor of the same shape as the input.

    So I changed replaced the return tf.strings.split(regex) with return regex (as splitting is changing the shape here). Please try like this.

    import tensorflow as tf
    
    def custom_standardization(input_data):
        lowercase = tf.strings.lower(input_data)
        regex = tf.strings.regex_replace(lowercase, r'[^\w]', ' ')
        regex = tf.strings.regex_replace(regex, ' +', ' ')
    
        return regex
    
    vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
        standardize=custom_standardization,
        max_tokens=50000,
        output_mode="int",
        output_sequence_length=100,
    )
    
    #checking input shape and output shape are shape or not 
    input = tf.constant([["foo !  @ qux  #bar"], ["qux baz"]])
    print(input)
    print(custom_standardization(input))
    
    vectorize_layer.adapt(["foo qux bar"])
    

    Providing gist for reference.