I have a model from hugging face and would like to use it for performing word comparisons. At first I thought of performing a series of similarity calculations across words of interest but quickly I found that this problem would exponentially grow as the number of words expanded as well.
A solution I thought about is plotting a skip gram where all words result on a 2 dimensional plane and then can simply perform clustering on the coordinates to find similar words. The problem here is that this requires a bert model and a low embedding layer that can be mapped.
As I have a pretrained model, I don't know if I can create a skip gram with from it. I was hoping to calculate the embedding and through the use of a transformation, convert the embedding into coordinates that I can plot myself. I though do not know if this is possible or reasonable
I tried to do it though with the code below
from sklearn.manifold import TSNE
from transformers import AutoModel, AutoTokenizer
# target word
word = ["Slartibartfast"]
# model setup
model = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model)
auto_model = AutoModel.from_pretrained(model, trust_remote_code=True)
# embbed and calculate
batch_dict = self.tokenizer(text_list, max_length=8192, padding=True, truncation=True, return_tensors='pt')
result = auto_model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0][:768]
# transform to coordinates
clayer = TSNE(n_components=3, learning_rate='auto', init='random', perplexity=50)
embedding_numpy = embeddings.detach().numpy()
clayer.fit_transform(embedding_numpy) # crashes here saying perplexity must be less than n_samples
After more through reading, it was brough to my attention that it would be impossible to use TSNE in the manner which I was hoping as the dimensions generated by TSNE is only representative of the training data. Further fitting with new data or transformation of data not within the training set would result in outputs that are not on a similar range and thus noncomparable.
I found a replacement to TSNE which is called umap. umap is also for dimension reduction but it can be fitted multiple times and data can be transformed along the same range.
I will explore umap and see if it will work for what I need.