Search code examples
tensorflowkerastensorflow2.0pickle

How to save TextVectorization to disk in tensorflow?


I have trained a TextVectorization layer (see below), and I want to save it to disk, so that I can reload it next time? I have tried pickle and joblib.dump(). It does not work.

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization 

text_dataset = tf.data.Dataset.from_tensor_slices(text_clean) 
    
vectorizer = TextVectorization(max_tokens=100000, output_mode='tf-idf',ngrams=None)
    
vectorizer.adapt(text_dataset.batch(1024))

The generated error is the following:

InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array

How can I save it?


Solution

  • Instead of pickling the object, pickle the configuration and weights. Later unpickle it and use configuration to create the object and load the saved weights. Official docs here.

    Code

    text_dataset = tf.data.Dataset.from_tensor_slices([
                                                       "this is some clean text", 
                                                       "some more text", 
                                                       "even some more text"]) 
    # Fit a TextVectorization layer
    vectorizer = TextVectorization(max_tokens=10, output_mode='tf-idf',ngrams=None)    
    vectorizer.adapt(text_dataset.batch(1024))
    
    # Vector for word "this"
    print (vectorizer("this"))
    
    # Pickle the config and weights
    pickle.dump({'config': vectorizer.get_config(),
                 'weights': vectorizer.get_weights()}
                , open("tv_layer.pkl", "wb"))
    
    print ("*"*10)
    # Later you can unpickle and use 
    # `config` to create object and 
    # `weights` to load the trained weights. 
    
    from_disk = pickle.load(open("tv_layer.pkl", "rb"))
    new_v = TextVectorization.from_config(from_disk['config'])
    # You have to call `adapt` with some dummy data (BUG in Keras)
    new_v.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
    new_v.set_weights(from_disk['weights'])
    
    # Lets see the Vector for word "this"
    print (new_v("this"))
    

    Output:

    tf.Tensor(
    [[0.         0.         0.         0.         0.91629076 0.
      0.         0.         0.         0.        ]], shape=(1, 10), dtype=float32)
    **********
    tf.Tensor(
    [[0.         0.         0.         0.         0.91629076 0.
      0.         0.         0.         0.        ]], shape=(1, 10), dtype=float32)