Search code examples
pythonpandasnlpsentence-transformers

How to add sentence embeddings derived from an existing column into a new column?


I have a dataframe that has four nw_data=['Qn_id', 'Qn_context', 'Qns', 'Anwsers']. This is how it looks like

Qn_id  |     Qn_context       |   Qns        |     Anwsers
 01    | In 1962, Uk gave...  | what year....| the year 1962 was.....
 02    | Major kanuti raised..| Who raised...| Kanuti akorimo rasied.

I want to add a fifth column to that dataset that consists of the sentence embeddings of the column['Answers'].

Am using the sentence_transformers to generate the sentence embeddings.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

I tried using an approach where:

#Created a var for the column
sent = nw_data['Answers']

and

#Passed the variable sent into the model and created the embeddings
embeddings = model.encode(sent)

then

#Tried passing the embeddings into a new column named Embeddings
nw_data['Embeddings'] = embeddings

I get an error:

KeyError: 'Embeddings'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
KeyError: 'Embeddings'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/blocks.py in check_ndim(values, placement, ndim)
   1978         if len(placement) != len(values):
   1979             raise ValueError(
-> 1980                 f"Wrong number of items passed {len(values)}, "
   1981                 f"placement implies {len(placement)}"
   1982             )

ValueError: Wrong number of items passed 384, placement implies 1

How can i create these embeddings and add them to a new column in the same dataframe nw_data!!

Is it possible anyway, was advised try using the .apply() method or lambda functions but the issues is am not sure on how or when to use them.


Solution

  • If I understand correctly, you'd like to insert a list (embedding) into a cell.

    Try using at:

    >>> import pandas as pd
    >>> from sentence_transformers import SentenceTransformer
    >>> sentences = 'Absence of sanity'
    >>> embedding = model.encode(sentences)
    >>> df = pd.DataFrame({'foo': [1, 2], 'Embedding': None})
    >>> df.at[0, 'Embedding'] = embedding.tolist()
    >>> df.dtypes
    foo           int64
    Embedding    object
    >>> df.head()
    dtype: object
       foo                                          Embedding
    0    1  [0.2954030930995941, 0.29181134700775146, 2.16...
    1    2                                               None
    

    If you have multiple sentences, just pass the list:

    >>> import pandas as pd
    >>> sentences = ['Absence of sanity', 'its a new day', 'make the best of it']
    >>> embeddings = model.encode(sentences)
    >>> df = pd.DataFrame({'foo': [1, 2, 3], 'Embedding': None})
    >>> df['Embedding'] = embeddings.tolist()
    >>> print(df.head())
       foo                                          Embedding
    0    1  [0.29540303349494934, 0.29181137681007385, 2.1...
    1    2  [0.0362740121781826, -0.8035800457000732, 2.44...
    2    3  [-0.4539063572883606, -0.4333038330078125, 2.2...