I have a list of strings, as well as an array containing the dense embeddings generated from those strings (generated using SentenceTransformer("all-mpnet-base-V2")
). In my analysis I notice that a number of the strings are repeated text. I want to run my analysis on unique values. Because generating embeddings is expensive, i want to subset my unique strings and get the relevant embeddings. I try this using set()
and np.unique()
, but get different length results. Why is this?
I don't post a reproducible example, because my array of vectors is large. But can anyone explain what might be happening and why these lengths wouldn't match, they are close but not the same?
#Basic structure of my data:
titles = ["some text", "other text", ...]
embeddings = [Array1, Array2, ...]
#Get unique items of the list
unique_titles = list(set(titles))
unique_embeddings = np.unique(embeddings, axis = 0)
len(unique_titles) == len(unique_embeddings)
False
I can get around this all with the following for loop:
titles_unique = []
embeddings_unique = np.array([0 for i in range(embeddings.shape[1])])
for t, e in zip(titles, embeddings):
if t not in titles_unique:
titles_unique.append(t)
embeddings_unique = np.vstack([embeddings_unique, e])
#Get rid of the first row of the array, used to create the correct number of dimensions
embeddings_unique = np.delete(embeddings_unique, (0), axis = 0)
But this is slow and I will have to do this for a much larger data set shortly.
The fact that set()
and np.unique()
don't give the same results makes me think i am losing information somewhere.
I think you are looking for the np.unique(..., return_index=True)
option. See the following example:
import numpy as np
titles = ["some text", "other text", "some text"]
embeddings = [np.asarray([0, 1, 0]), np.asarray([1, 0, 0]), np.asarray([0, 1, 0])]
unique_titles, idx = np.unique(titles, return_index=True)
unique_embeddings = np.array(embeddings)[idx]
len(unique_titles) == len(unique_embeddings)
Which returns True
.
I hope this helps.