Search code examples
pythonpytorchhuggingface-transformersattention-model

Load Phi 3 model extract attention layer and visualize it


I would like to visualize the attention layer of a Phi-3-medium-4k-instruct (or mini) model downloaded from hugging-face. In particular, I am using the following model, tokenizer:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import pdb

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-meduium-4k-instruct",
    device_map = "auto",
    torch_dtype = "auto",
    trust_remote_code = True
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    return_full_text= False,
    max_new_tokens = 50,
    do_sample = False
)

prompt = "..."
input_ids = tokenizer(prompt, return_tensors = "pt").input_ids
# tokenize the input prompt
input_ids = input_ids.to("cuda:0")
# get the output of the model
model_output = model.model(input_ids)

# extract the attention layer
attention = model_output[-1] 

Firstly, I am wondering if that is the correct way to extract attention from my model. What should expect from this model and how can I visualize it properly? Isn't that I should expect a matrix n_tokens x n_tokens?

The attention variable I have extracted has a size of 1x40x40x15x15 (or 1x12x12x15x15 in the case of mini model), where the first dimension corresponds to different layers the second for the different heads, and the final two for the attention matrix. That is actually my assumption and I am not sure whether it is correct. When I am visualizing the attention I am getting some very weird matrices like:

enter image description here

What we see in this Figure, I assume is all the heads for one layer. However, most of the heads distribute the attention equally to all the tokens. Does that make sense?

Edit: For the visualization I am doing sth like:

# Save attention visualization code 
def save_attention_image(attention, tokens, filename='attention.png'):
    """
    Save the attention weights for a specific layer and head as an image.
    
    :param attention: The attention weights from the model.
    :param tokens: The tokens corresponding to the input.
    :param layer_num: The layer number to visualize.
    :param head_num: The head number to visualize.
    :param filename: The filename to save the image.
    """

    attn = attention[0].detach().cpu().float().numpy()    
    num_heads = attn.shape[0]
    fig, axes = plt.subplots(3, 4, figsize=(20, 15))  # Adjust the grid size as needed
    
    for i, ax in enumerate(axes.flat):
        if i < num_heads:
            cax = ax.matshow(attn[i], cmap='viridis')
            ax.set_title(f'Head {i + 1}')
            ax.set_xticks(range(len(tokens)))
            ax.set_yticks(range(len(tokens)))
            ax.set_xticklabels(tokens, rotation=90)
            ax.set_yticklabels(tokens)
        else:
            ax.axis('off')
    
    fig.colorbar(cax, ax=axes.ravel().tolist())
    plt.suptitle(f'Layer {1}')
    plt.savefig(filename)
    plt.close()

Solution

  • here is what you need to know: RUNNING COLAB CODE - https://colab.research.google.com/drive/13gP71u_u_Ewx8u7aTwgzSlH0N_k9XBXx?usp=sharing

    you want see attention weights from your phi3 model. first thing: you must tell model to output attentions. usually you do

    outputs = model(input_ids, output_attentions=True)
    

    then outputs.attentions will be tuple with one element per layer. each element is tensor shape (batch, num_heads, seq_len, seq_len) – that is what you expect, a matrix n_tokens x n_tokens per head.

    what you did using

    model_output = model.model(input_ids)
    attention = model_output[-1]
    

    may or may not be correct – depends on how model.forward is coded. better use output_attentions flag so you get proper attention weights.

    about the shape you see, e.g. 1x40x40x15x15 (or 1x12x12x15x15) – this likely means:

    • 1 is batch size,
    • next dimension is number of layers (40 for medium, 12 for mini),
    • next is number of heads per layer,
    • and last two are the attention matrices (each head gets a 15x15 attention matrix if you have 15 tokens).

    if many heads show nearly uniform attention it can be normal – sometimes heads do that, not focusing on any token particularly.

    for proper visualization, select one layer and one head like:

    attn = outputs.attentions[layer][0, head]  # shape (seq_len, seq_len)
    

    and then use your plotting code (imshow or matshow) to visualize.

    so summary: use model(..., output_attentions=True) to get correct attention, then each attention tensor will be (batch, heads, seq_len, seq_len) – that is the matrix you expect. if you see extra dimensions then check if you are calling the right forward method. and yes, many heads may show uniform distribution – that can be normal in transformer models.

    hope this helps, and you can put my code in your colab as is.

    note that

    When using Hugging Face Transformers, the recommended approach is to run:

    outputs = model(
        input_ids=inputs,
        output_attentions=True,
        # possibly also output_hidden_states=True if you want hidden states
    )
    

    Then outputs.attentions will be a tuple with one entry per layer, each entry shaped (batch_size, num_heads, seq_len, seq_len).

    If you call model.model(input_ids) directly (as in your code snippet), you might be accessing a lower-level forward function that returns a different structure. Instead, call the top-level model with output_attentions=True. That yields attention shapes more in line with standard Hugging Face conventions.

    Ok so basically you want see attention. You pass output_attentions=True when calling model, then get outputs.attentions. That is standard shape (batch, heads, seq_len, seq_len). Then pick layer and head to plot. Some heads look uniform, that is normal. If you do model.model(input_ids) directly, might not give the standard shape. Safer is:

    # !pip install transformers torch
    
    import torch
    import matplotlib.pyplot as plt
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    # Load tokenizer and model (make sure you have a valid license for the model)
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct")
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Phi-3-medium-4k-instruct",  # note: check spelling if you get error
        device_map="auto",
        torch_dtype=torch.float16,            # or torch.float32 if preferred
        trust_remote_code=True
    )
    
    # Prepare a prompt
    prompt = "The quick brown fox jumps over the lazy dog."
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = inputs.to("cuda:0")  # send inputs to cuda
    
    # Run the model with attention outputs enabled
    # Make sure to pass output_attentions=True
    outputs = model(input_ids=inputs.input_ids, output_attentions=True)
    
    # outputs.attentions is a tuple with one element per layer
    # Each element is a tensor of shape (batch_size, num_heads, seq_len, seq_len)
    attentions = outputs.attentions
    
    # For example, choose layer 0 and head 0 to visualize
    layer = 0
    head = 0
    attn = attentions[layer][0, head].detach().cpu().numpy()  # shape (seq_len, seq_len)
    
    # Get tokens for labeling the axes
    tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
    
    # Visualize the attention matrix using matplotlib
    plt.figure(figsize=(8,8))
    plt.imshow(attn, cmap="viridis")
    plt.colorbar()
    plt.xticks(range(len(tokens)), tokens, rotation=90)
    plt.yticks(range(len(tokens)), tokens)
    plt.title(f"Attention Matrix (Layer {layer}, Head {head})")
    plt.show()
    

    COLAB PROOF OF OUTPUT

    Now you see nice n_tokens by n_tokens matrix. If model has 12 layers, you see 12 in outputs.attentions. If “medium” is 40 layers, you see 40. Each head is shape 15×15 if your input is 15 tokens. Some heads do uniform attention, that is normal. That is basically all.

    NOTE -

    When you do something like:

    model_output = model.model(input_ids)
    attention = model_output[-1]
    

    You’re relying on how the internal forward method organizes its return. Some models do return (hidden_states, present, attentions, ...) but some do not. It’s safer to rely on the official Hugging Face usage:

    outputs = model(..., output_attentions=True)
    attention = outputs.attentions
    

    That’s guaranteed to be the standard shape.