I'm working on a Python project that mimics https://www.mtgassist.com/. For those not too familiar: Magic is a trading card game that has collectible cards that can be very expensive. The project should take the name of a card and list other cards that have similar mechanics (and are hopefully cheaper) based on several features, including the "oracle_text", that describes what the card does.
porter = PorterStemmer()
def tokenizer_stemmer(text: str) -> str:
stop = stopwords.words('english')
return [porter.stem(word) for word in text.split() if word not in stop]
tfidf = TfidfVectorizer(
ngram_range=(1,2)
, tokenizer=tokenizer_stemmer
, stop_words=stopwords.words('english')
)
token_mat = tfidf.fit_transform(df_not_na['oracle_text'])
token_mat
in a numpy array (token_arr
) with shape ~(20_000, 90_000) and calculate the euclidean distance between the chosen card and all cards in the array (this takes an additional 25 seconds). Finally, I print the names of the top 5 "closest" cards:token_arr = token_mat.toarray()
distances = []
for _card in tqdm(token_mat):
distances.append(np.linalg.norm(_card - chosen_card_array))
nearest_5 = np.argpartition(distances, 10)[:10]
print(df_not_na.iloc[nearest_5][['name', 'oracle_text']])
My goal is to optimize this process and reduce the time creating the feature vector and calculating the distances.
I tried using just bigrams instead of ngram_range=(1,2), but it made very little difference.
I also thought of using numba, but read that sklearn/numpy have similar embedded capabilities and it would not benefit much.
Let me know of other suggestions as well! Thanks
I see two sources of inefficiency,
Ways to improve this,