Search code examples
pandasnumpyhuggingface-transformerstorch

Tensor to Dataframe for each sentence


For a 6 class sentence classification task, I have a list of sentences where I retrieve the absolute values before the softmax is applied. Example list of sentences:

s = ['I like the weather today', 'The movie was very scary', 'Love is in the air']

I get the values the following way:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

for i in s:
  sentence = tokenizer(i, return_tensors="pt")
  output = model(sentence["input_ids"])
  print(output.logits.detach().numpy())

  # returns [[-0.8390876   2.9480567  -0.5134539   0.70386493 -0.5019671  -2.619496  ]]
  #[[-0.8847909  -0.9642067  -2.2108874  -0.43932158  4.3386173  -0.37383893]]
  #[[-0.48750368  3.2949197   2.1660519  -0.6453249  -1.7101991  -2.817954  ]]

How do I create a data frame with columns sentence, class_1, class_2, class_3, class_4, class_5, class_6 where I add values iteratively or maybe in a more optimal way where I append each new sentence and its absolute values? What would be the best way?

Expected output:

     sentence                   class_1        class_2    class_3      ....
0    I like the weather today   -0.8390876     2.9480567  -0.5134539   ....
1    The movie was very scary   -0.8847909     -0.9642067 -2.2108874   ....
2    Love is in the air         -0.48750368    3.2949197   2.1660519   ....  
...

If I only had one sentence, I could transform it to a data frame like this, but I would still need to append the sentence somehow

sentence = tokenizer("Love is in the air", return_tensors="pt")
output = model(sentence["input_ids"])

px = pd.DataFrame(output.logits.detach().numpy())

Maybe creating two separate data frames and then appending them would be one plausible way of doing this?


Solution

  • I managed to come up with a solution and I am posting it as someone might find it useful.

    The idea is to initialize a data frame and to append the absolute values for every sentence while iterating

    absolute_vals = pd.DataFrame()
    
    for i in s:
      sentence = tokenizer(i, return_tensors="pt")
      output = model(sentence["input_ids"]) 
      px = pd.DataFrame(output.logits.detach().numpy())
      absolute_vals = absolute_vals.append(px, ignore_index = True)
    
    absolute_vals
    

    Returns:

         sentence                   class_1        class_2    class_3      ....
    0    I like the weather today   -0.8390876     2.9480567  -0.5134539   ....
    1    The movie was very scary   -0.8847909     -0.9642067 -2.2108874   ....
    2    Love is in the air         -0.48750368    3.2949197   2.1660519   ....  
    ...