Search code examples
pythonnumpysentence-transformers

Subset list of text and matching embeddings using set() and np.unique() gives different length results


I have a list of strings, as well as an array containing the dense embeddings generated from those strings (generated using SentenceTransformer("all-mpnet-base-V2")). In my analysis I notice that a number of the strings are repeated text. I want to run my analysis on unique values. Because generating embeddings is expensive, i want to subset my unique strings and get the relevant embeddings. I try this using set() and np.unique(), but get different length results. Why is this?

I don't post a reproducible example, because my array of vectors is large. But can anyone explain what might be happening and why these lengths wouldn't match, they are close but not the same?

#Basic structure of my data:
titles = ["some text", "other text", ...]
embeddings = [Array1, Array2, ...]
#Get unique items of the list
unique_titles = list(set(titles))
unique_embeddings = np.unique(embeddings, axis = 0)
    
len(unique_titles) == len(unique_embeddings)
False

I can get around this all with the following for loop:

titles_unique = []
embeddings_unique = np.array([0 for i in range(embeddings.shape[1])])
for t, e in zip(titles, embeddings):
    if t not in titles_unique:
        titles_unique.append(t)
        embeddings_unique = np.vstack([embeddings_unique, e])

#Get rid of the first row of the array, used to create the correct number of dimensions
embeddings_unique = np.delete(embeddings_unique, (0), axis = 0)

But this is slow and I will have to do this for a much larger data set shortly.

The fact that set() and np.unique() don't give the same results makes me think i am losing information somewhere.


Solution

  • I think you are looking for the np.unique(..., return_index=True) option. See the following example:

    import numpy as np
    
    titles = ["some text", "other text", "some text"]
    embeddings = [np.asarray([0, 1, 0]), np.asarray([1, 0, 0]), np.asarray([0, 1, 0])]
    
    unique_titles, idx = np.unique(titles, return_index=True)
    unique_embeddings = np.array(embeddings)[idx]
        
    len(unique_titles) == len(unique_embeddings)
    

    Which returns True.

    I hope this helps.