python nlp sparse-matrix tf-idf tfidfvectorizer

How can i group words to reduce vocabulary in python td idf vectorizer

I want to reduce the size of the sparse matrix of the tf-idf vectorizer outputs since i am using it with cosine similarity and it takes a long time to go through each vector. I have about 44,000 sentences so the vocabulary size is also very large.

I was wondering if there was a way to combine a group of words to mean one word for example teal, navy and turquiose will all mean blue and that will have same tf-idf value.

I am dealing with a dataset of clothing items so things like colour, and similar clothing articles like shirt, t-shirt and sweatshirts are things i want to group.

I know i can use stop words to give certain words a value of 1 but is it possible to group words to have the same value?

Here is my code

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

dataset_2 = "/dataset_files/styles_2.csv"
df = pd.read_csv(dataset_2)
df = df.drop(['gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'year', 'usage'], axis = 1)

tfidf = TfidfVectorizer(stop_words='english') 
tfidf_matrix = tfidf.fit_transform(new_df['ProductDisplayName'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Solution

Unfortunately we can't use the vocabulary optional argument to TfidfVectorizer to signal synonyms; I tried and got error ValueError: Vocabulary contains repeated indices."

Instead, you could run the tfidf vectorizer algorithm once, then manually merge columns that correspond to synonyms.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## DATA
corpus = ['The grey cat eats the navy mouse.',
          'The ashen cat drives the red car.',
          'There is a mouse on the brown banquette of the crimson car.',
          'The teal car drove over the poor cat and tarnished its beautiful silver fur with scarlet blood.',
          'I bought a turquoise sapphire shaped like a cat and  mounted on a rose gold ring.',
          'Mice and cats alike are drowning in the deep blue sea.']
synonym_groups = [['grey', 'gray', 'ashen', 'silver'],
                  ['red', 'crimson', 'rose', 'scarlet'],
                  ['blue', 'navy', 'sapphire', 'teal', 'turquoise']]

## VECTORIZING FIRST TIME TO GET vectorizer0.vocabulary_
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

## MERGING SYNONYM COLUMNS
vocab = vectorizer.vocabulary_
synonym_representants = { group[0] for group in synonym_groups }
redundant_synonyms = { word: group[0] for group in synonym_groups for word in group[1:] }
syns_dict = {group[0]: group for group in synonym_groups}
# syns_dict = {next(word for word in group if word in vocab): group for group in synonym_groups} ## SHOULD BE MORE ROBUST

nonredundant_columns = sorted( v for k, v in vocab.items() if k not in redundant_synonyms )

for rep in synonym_representants:
    X[:,vocab[rep]] = X[:, [vocab[syn] for syn in syns_dict[rep] if syn in vocab]].sum(axis=1)

Y = X[:, nonredundant_columns]
new_vocab = [w for w in sorted(vocab, key=vocab.get) if w not in redundant_synonyms]

## COSINE SIMILARITY
cos_sim = cosine_similarity(Y, Y)

## RESULTS
print(' ', ''.join('{:11.11}'.format(word) for word in new_vocab))
print(Y.toarray())
print()
print('Cosine similarity')
print(cos_sim)

Output:

  alike      banquette  beautiful  blood      blue       bought     brown      car        cat        cats       deep       drives     drove      drowning   eats       fur        gold       grey       like       mice       mounted    mouse      poor       red        ring       sea        shaped     tarnished 
[[0.         0.         0.         0.         0.49848319 0.         0.         0.         0.29572971 0.         0.         0.         0.         0.         0.49848319 0.         0.         0.49848319 0.         0.         0.         0.40876335 0.         0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.         0.         0.35369727 0.30309169 0.         0.         0.51089257 0.         0.         0.         0.         0.         0.51089257 0.         0.         0.         0.         0.         0.51089257 0.         0.         0.         0.        ]
 [0.         0.490779   0.         0.         0.         0.         0.490779   0.3397724  0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.4024458  0.         0.490779   0.         0.         0.         0.        ]
 [0.         0.         0.31893014 0.31893014 0.31893014 0.         0.         0.2207993  0.18920822 0.         0.         0.         0.31893014 0.         0.         0.31893014 0.         0.31893014 0.         0.         0.         0.         0.31893014 0.31893014 0.         0.         0.         0.31893014]
 [0.         0.         0.         0.         0.65400152 0.32700076 0.         0.         0.19399619 0.         0.         0.         0.         0.         0.         0.         0.32700076 0.         0.32700076 0.         0.32700076 0.         0.         0.32700076 0.32700076 0.         0.32700076 0.        ]
 [0.37796447 0.         0.         0.         0.37796447 0.         0.         0.         0.         0.37796447 0.37796447 0.         0.         0.37796447 0.         0.         0.         0.         0.         0.37796447 0.         0.         0.         0.         0.         0.37796447 0.         0.        ]]

Cosine similarity
[[1.         0.34430458 0.16450509 0.37391712 0.3479721  0.18840894]
 [0.34430458 1.         0.37091192 0.46132163 0.20500145 0.        ]
 [0.16450509 0.37091192 1.         0.23154573 0.14566346 0.        ]
 [0.37391712 0.46132163 0.23154573 1.         0.3172916  0.12054426]
 [0.3479721  0.20500145 0.14566346 0.3172916  1.         0.2243601 ]
 [0.18840894 0.         0.         0.12054426 0.2243601  1.        ]]