Search code examples
scikit-learnfeature-extraction

Optimizing feature extraction with sklearn tfidf


I'm working on a Python project that mimics https://www.mtgassist.com/. For those not too familiar: Magic is a trading card game that has collectible cards that can be very expensive. The project should take the name of a card and list other cards that have similar mechanics (and are hopefully cheaper) based on several features, including the "oracle_text", that describes what the card does.

  1. I'm using sklearn's TfidfVectorizer with the following parameters:
porter = PorterStemmer()
def tokenizer_stemmer(text: str) -> str:
    stop = stopwords.words('english')
    return [porter.stem(word) for word in text.split() if word not in stop]

tfidf = TfidfVectorizer(
    ngram_range=(1,2)
    , tokenizer=tokenizer_stemmer
    , stop_words=stopwords.words('english')
)
  1. Then I use TfidfVectorizer.fit_transform with a pandas DataFrame with ~20k rows I previously loaded. This process takes about 25 seconds:

token_mat = tfidf.fit_transform(df_not_na['oracle_text'])

  1. Next, I convert token_mat in a numpy array (token_arr) with shape ~(20_000, 90_000) and calculate the euclidean distance between the chosen card and all cards in the array (this takes an additional 25 seconds). Finally, I print the names of the top 5 "closest" cards:
token_arr = token_mat.toarray()

distances = []
for _card in tqdm(token_mat):
    distances.append(np.linalg.norm(_card - chosen_card_array))

nearest_5 = np.argpartition(distances, 10)[:10]
print(df_not_na.iloc[nearest_5][['name', 'oracle_text']])

My goal is to optimize this process and reduce the time creating the feature vector and calculating the distances.

I tried using just bigrams instead of ngram_range=(1,2), but it made very little difference.

I also thought of using numba, but read that sklearn/numpy have similar embedded capabilities and it would not benefit much.

Let me know of other suggestions as well! Thanks


Solution

  • I see two sources of inefficiency,

    1. TF-IDF representations results in very high dimensional representations. Even when you use ngram length = 1, it results in a dimensionality equal to your vocabulary size.
    2. You are computing the distance from each of the 20k cards which is a simple but inefficient O(n) time approach.

    Ways to improve this,

    1. Consider using word embeddings. For example, take a pretrained Word2Vec or GloVe model, find the average embedding of all your words in the sentence as the vector representation for that sentence. Or you can use BERT to directly get sentence embeddings, however it might overkill for your need. If you want to stick to TF-IDF, you can apply dimensionality reduction using PCA or T-SNE or UMAP to get shorter representations.
    2. To lookup the closest k neighbours, especially if you have to repeatedly do this for many different query cards, consider using a vector database like hnswlib where you index your card vectors ahead of time and it is prepared to give you the top 10 closest vectors quickly using the HNSW algorithm.