We got a task to run the inference for Llama-2 models(particularly 7B and 13B chat models). So we choose inf1 instance(inf1.6xlarge) to run inference. During Installation we opted for Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04). We followed the steps from the guide(Aws neuron setup guide) to setup the instance for inference. Then We've referred the tutorial Llama-2-13B sampling tutorial. During the process we faced a neuron runtime issues
Solution tried so far:
While searching online for solution we found there are separate installation available for inf1 instance, so we followed this guide inf1_aws_neuron_installation and completed the setup. But while running above mentioned sampling code we've got neuron_error. Please help us to run the inference on inf1 instance and clarify the questions given below.
The official release note of the llama2 neuron support said that it can be trained and served by only trn1, inf2 instance. they do not mentioned the inf1 instance.
https://aws.amazon.com/about-aws/whats-new/2023/08/aws-neuron-llama2-gpt-neox-sdxl-ai-models/