Search code examples
azurenvidiaazure-machine-learning-serviceroberta-language-modelsentence-transformers

Segmentation fault error in importing sentence_transformers in Azure Machine Learning Service Nvidia Compute


I would like to use sentence_transformers in AML to run XLM-Roberta model for sentence embedding. I have a script in which I import sentence_transformers:

from sentence_transformers import SentenceTransformer

Once I run my AML pipeline, the run fails on this script with the following error:

AzureMLCompute job failed.
UserProcessKilledBySystemSignal: Job failed since the user script received system termination signal usually due to out-of-memory or segfault.
    Cause: segmentation fault
    TaskIndex: 
    NodeIp: #####
    NodeId: #####

I'm pretty sure that this import is causing this error, because if I comment out this import, the rest of the script will run. This is weird because the installation of the sentence_transformers succeed.

This is the details of my compute:

Virtual machine size
STANDARD_NV24 (24 Cores, 224 GB RAM, 1440 GB Disk)
Processing Unit
GPU - 4 x NVIDIA Tesla M60

Agent Pool:

Azure Pipelines

Agent Specification:

ubuntu-16.04

requirements.txt file:

torch==1.4.0
sentence-transformers

Does anyone have a solution for this error?


Solution

  • I fixed the issue by changing the pytorch version from 1.4.0 to 1.6.0. So the requirements.txt looks like this:

    torch==1.6.0
    sentence-transformers
    

    At first I tried one of the older versions of sentence-transformers which was compatible with pytorch 1.4.0. But the older version doesn't support "xml-roberta-base" model, so I tried to upgrade the pytorch version.