Search code examples
pythonamazon-web-servicesamazon-sagemaker

No space left on device while deploying fine tuned model with SageMaker


I fine tuned a huggingface-llm-mistral-7b model using a SageMaker JumpStartEstimator. The artifacts are stored in S3, compressed as a .tar.gz file.

Now I'm trying to deploy said model. Using the python SDK and I run the following code:

INFERENCE_INSTANCE_TYPE = "ml.g5.2xlarge"
MODEL_ID = "huggingface-llm-mistral-7b"
MODEL_VERSION = "*"
SAGEMAKER_ROLE = "arn:aws:iam::257342474:role/AmazonSageMakerFullAccess"

endpoint_name = name_from_base(f"jumpstart-{MODEL_ID}")

deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=MODEL_ID,
    model_version=MODEL_VERSION,
    instance_type=INFERENCE_INSTANCE_TYPE,
)
deploy_source_uri = script_uris.retrieve(
    model_id=MODEL_ID, model_version=MODEL_VERSION, script_scope="inference"
)
base_model_uri = model_uris.retrieve(
    model_id=MODEL_ID, model_version=MODEL_VERSION, model_scope="inference"
)
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data="s3://path/to/model/model.tar.gz",
    entry_point="inference.py",
    role=SAGEMAKER_ROLE,
    predictor_cls=Predictor,
    name=endpoint_name,
)

base_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=INFERENCE_INSTANCE_TYPE,
    endpoint_name=endpoint_name,
    volume_size=50,
)

And the error I get is:

OSError: [Errno 28] No space left on device

My instance should have enough RAM and disk space, I can't figure out why this error is raised.

The error seems to come from the model creation step so I tried replacing the deploy instruction by:

model.create(instance_type=INFERENCE_INSTANCE_TYPE)

And the same error is raised.

I also tried to increase volume size, use a ml.g5.12xlarge instance as well as to use a ServerlessInferenceConfig to no effect.

Could anyone provide advice on how to fix this or how to troubleshoot the source of the error ?


Solution

  • I recommend following steps to investigate the issue.

    1. Identify the environment where "No space left on device" is happening, between your development environment (where you are running your script) and model container (which is hosting the endpoint).
    2. Check full call-stack to know what is the direct cause of the error.
    3. Use SSM to access the container, if the error is happening in the container.
    4. Run "df" command to see disk utilization of the environment, to know which disk space is reaching 100%.

    Solution would depends on the result of these steps.