Search code examples
palantir-foundryfoundry-code-workbooks

Import pre-trained Deep Learning Models into Foundry Codeworkbooks


How do you import a h5 model locally from Foundry into code workbook? I want to use the hugging face library as shown below, and in its documentation the from_pretrained method expects a URL path to the where the pretrained model lives.

I would ideally like to download the model onto my local machine, upload it onto Foundry, and have Foundry read in said model.

For reference I’m trying to do this on code workbook or code authoring. It looks like you can work directly with files from there, but I’ve read the documentation and the given example was for a CSV file whereas this model contains a variety of files like h5 and json format. Wondering how I can access these files and have them passsed into the from_pretrained method from the transformers package

Relevant links: https://huggingface.co/transformers/quicktour.html Pre-trained Model: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/tree/main

Thank you!


Solution

  • I've gone ahead and added the transformers (hugging face) package onto the platform.

    As for the uploading the package you can follow these steps:

    1. Use your dataset with the model-related files as an input to your code workbook transform

    2. Use pythons raw file access to access the contents of the dataset: https://docs.python.org/3/library/filesys.html

    3. Use pythons built-in tempfile to build a folder and add the files from step 2, https://docs.python.org/3/library/tempfile.html#tempfile.mkdtemp , https://www.kite.com/python/answers/how-to-write-a-file-to-a-specific-directory-in-python

    4. Pass in the tempfile (tempfile.mkdtemp() return the absolute path) to the from_pretrained method

    import tempfile
    
    def sample (dataset_with_model_folder_uploaded):
      full_folder_path = tempfile.mkdtemp()
    
      all_file_names = ['config.json', 'tf_model.h5', 'ETC.ot', ...]
    
      for file_name in all_file_names:
        with dataset_with_model_folder_uploaded.filesystem().open(file_name) as f:
          pathOfFile = os.path.join(fullFolderPath, file_name)
          newFile = open(pathOfFile, "w")
          newFile.write(f.read())
          newFile.close()
      
      model = TF.DistilSequenceClassification.from_pretrained(full_folder_path)
      tokenizer = TF.Tokenizer.from_pretrained(full_folder_path)
    

    Thanks,