Compiling Llama-2 model on inf1 instance using aws neuron

We got a task to run the inference for Llama-2 models(particularly 7B and 13B chat models). So we choose inf1 instance(inf1.6xlarge) to run inference. During Installation we opted for Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04). We followed the steps from the guide(Aws neuron setup guide) to setup the instance for inference. Then We've referred the tutorial Llama-2-13B sampling tutorial. During the process we faced a neuron runtime issues

Solution tried so far:
While searching online for solution we found there are separate installation available for inf1 instance, so we followed this guide inf1_aws_neuron_installation and completed the setup. But while running above mentioned sampling code we've got neuron_error. Please help us to run the inference on inf1 instance and clarify the questions given below.

How does the above mentioned neuron runtime error can be solved?
Does transformers-neuronx package is available for inf1 instance, if not how the Llama-2 models can be compiled using neuron-cc?
We saw that torch.jit.trace and torch.neuron.trace functions are getting used to compile the model via neuron-cc but can this work for Llama-2 models?

Solution

The official release note of the llama2 neuron support said that it can be trained and served by only trn1, inf2 instance. they do not mentioned the inf1 instance.

https://aws.amazon.com/about-aws/whats-new/2023/08/aws-neuron-llama2-gpt-neox-sdxl-ai-models/