Search code examples
pythontensorflowgoogle-colaboratorypickle

Save/Export a custom tokenizer from google colab notebook


I have a custom tokenizer and want to use it for prediction in Production API. How do I save/download the tokenizer?

This is my code trying to save it:

import pickle
from tensorflow.python.lib.io import file_io

with file_io.FileIO('tokenizer.pickle', 'wb') as handle:
  pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

No error, but I can't find the tokenizer after saving it. So I assume the code didn't work?


Solution

  • Here is the situation, using a simple file to disentangle the issue from irrelevant specificities like pickle, Tensorflow, and tokenizers:

    # Run in a new Colab notebook:
    %pwd
    /content
    %ls
    sample_data/
    

    Let's save a simple file foo.npy:

    import numpy as np
    np.save('foo', np.array([1,2,3]))
    
    %ls
    foo.npy  sample_data/
    

    In this stage, %ls should show tokenizer.pickle in your case instead of foo.npy.

    Now, Google Drive & Colab do not communicate by default; you have to mount the drive first (it will ask for identification):

    from google.colab import drive
    drive.mount('/content/drive')
    
    Mounted at /content/drive
    

    After which, an %ls command will give:

    %ls
    drive/  foo.npy  sample_data/
    

    and you can now navigate (and save) inside drive/ (i.e. actually in your Google Drive), changing the path accordingly. Anything saved there can be retrieved later.