Subset list of text and matching embeddings using set() and np.unique() gives different length results

I have a list of strings, as well as an array containing the dense embeddings generated from those strings (generated using SentenceTransformer("all-mpnet-base-V2")). In my analysis I notice that a number of the strings are repeated text. I want to run my analysis on unique values. Because generating embeddings is expensive, i want to subset my unique strings and get the relevant embeddings. I try this using set() and np.unique(), but get different length results. Why is this?

I don't post a reproducible example, because my array of vectors is large. But can anyone explain what might be happening and why these lengths wouldn't match, they are close but not the same?

#Basic structure of my data:
titles = ["some text", "other text", ...]
embeddings = [Array1, Array2, ...]
#Get unique items of the list
unique_titles = list(set(titles))
unique_embeddings = np.unique(embeddings, axis = 0)
    
len(unique_titles) == len(unique_embeddings)
False

I can get around this all with the following for loop:

titles_unique = []
embeddings_unique = np.array([0 for i in range(embeddings.shape[1])])
for t, e in zip(titles, embeddings):
    if t not in titles_unique:
        titles_unique.append(t)
        embeddings_unique = np.vstack([embeddings_unique, e])

#Get rid of the first row of the array, used to create the correct number of dimensions
embeddings_unique = np.delete(embeddings_unique, (0), axis = 0)

But this is slow and I will have to do this for a much larger data set shortly.

The fact that set() and np.unique() don't give the same results makes me think i am losing information somewhere.

Solution

I think you are looking for the np.unique(..., return_index=True) option. See the following example:

import numpy as np

titles = ["some text", "other text", "some text"]
embeddings = [np.asarray([0, 1, 0]), np.asarray([1, 0, 0]), np.asarray([0, 1, 0])]

unique_titles, idx = np.unique(titles, return_index=True)
unique_embeddings = np.array(embeddings)[idx]
    
len(unique_titles) == len(unique_embeddings)

Which returns True.

I hope this helps.