Search code examples
pythonnlpsparse-matrixtf-idftfidfvectorizer

How can i group words to reduce vocabulary in python td idf vectorizer


I want to reduce the size of the sparse matrix of the tf-idf vectorizer outputs since i am using it with cosine similarity and it takes a long time to go through each vector. I have about 44,000 sentences so the vocabulary size is also very large.

I was wondering if there was a way to combine a group of words to mean one word for example teal, navy and turquiose will all mean blue and that will have same tf-idf value.

I am dealing with a dataset of clothing items so things like colour, and similar clothing articles like shirt, t-shirt and sweatshirts are things i want to group.

I know i can use stop words to give certain words a value of 1 but is it possible to group words to have the same value?

Here is my code

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

dataset_2 = "/dataset_files/styles_2.csv"
df = pd.read_csv(dataset_2)
df = df.drop(['gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'year', 'usage'], axis = 1)

tfidf = TfidfVectorizer(stop_words='english') 
tfidf_matrix = tfidf.fit_transform(new_df['ProductDisplayName'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


Solution

  • Unfortunately we can't use the vocabulary optional argument to TfidfVectorizer to signal synonyms; I tried and got error ValueError: Vocabulary contains repeated indices."

    Instead, you could run the tfidf vectorizer algorithm once, then manually merge columns that correspond to synonyms.

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    ## DATA
    corpus = ['The grey cat eats the navy mouse.',
              'The ashen cat drives the red car.',
              'There is a mouse on the brown banquette of the crimson car.',
              'The teal car drove over the poor cat and tarnished its beautiful silver fur with scarlet blood.',
              'I bought a turquoise sapphire shaped like a cat and  mounted on a rose gold ring.',
              'Mice and cats alike are drowning in the deep blue sea.']
    synonym_groups = [['grey', 'gray', 'ashen', 'silver'],
                      ['red', 'crimson', 'rose', 'scarlet'],
                      ['blue', 'navy', 'sapphire', 'teal', 'turquoise']]
    
    ## VECTORIZING FIRST TIME TO GET vectorizer0.vocabulary_
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(corpus)
    
    ## MERGING SYNONYM COLUMNS
    vocab = vectorizer.vocabulary_
    synonym_representants = { group[0] for group in synonym_groups }
    redundant_synonyms = { word: group[0] for group in synonym_groups for word in group[1:] }
    syns_dict = {group[0]: group for group in synonym_groups}
    # syns_dict = {next(word for word in group if word in vocab): group for group in synonym_groups} ## SHOULD BE MORE ROBUST
    
    nonredundant_columns = sorted( v for k, v in vocab.items() if k not in redundant_synonyms )
    
    for rep in synonym_representants:
        X[:,vocab[rep]] = X[:, [vocab[syn] for syn in syns_dict[rep] if syn in vocab]].sum(axis=1)
    
    Y = X[:, nonredundant_columns]
    new_vocab = [w for w in sorted(vocab, key=vocab.get) if w not in redundant_synonyms]
    
    ## COSINE SIMILARITY
    cos_sim = cosine_similarity(Y, Y)
    
    ## RESULTS
    print(' ', ''.join('{:11.11}'.format(word) for word in new_vocab))
    print(Y.toarray())
    print()
    print('Cosine similarity')
    print(cos_sim)
    

    Output:

      alike      banquette  beautiful  blood      blue       bought     brown      car        cat        cats       deep       drives     drove      drowning   eats       fur        gold       grey       like       mice       mounted    mouse      poor       red        ring       sea        shaped     tarnished 
    [[0.         0.         0.         0.         0.49848319 0.         0.         0.         0.29572971 0.         0.         0.         0.         0.         0.49848319 0.         0.         0.49848319 0.         0.         0.         0.40876335 0.         0.         0.         0.         0.         0.        ]
     [0.         0.         0.         0.         0.         0.         0.         0.35369727 0.30309169 0.         0.         0.51089257 0.         0.         0.         0.         0.         0.51089257 0.         0.         0.         0.         0.         0.51089257 0.         0.         0.         0.        ]
     [0.         0.490779   0.         0.         0.         0.         0.490779   0.3397724  0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.         0.4024458  0.         0.490779   0.         0.         0.         0.        ]
     [0.         0.         0.31893014 0.31893014 0.31893014 0.         0.         0.2207993  0.18920822 0.         0.         0.         0.31893014 0.         0.         0.31893014 0.         0.31893014 0.         0.         0.         0.         0.31893014 0.31893014 0.         0.         0.         0.31893014]
     [0.         0.         0.         0.         0.65400152 0.32700076 0.         0.         0.19399619 0.         0.         0.         0.         0.         0.         0.         0.32700076 0.         0.32700076 0.         0.32700076 0.         0.         0.32700076 0.32700076 0.         0.32700076 0.        ]
     [0.37796447 0.         0.         0.         0.37796447 0.         0.         0.         0.         0.37796447 0.37796447 0.         0.         0.37796447 0.         0.         0.         0.         0.         0.37796447 0.         0.         0.         0.         0.         0.37796447 0.         0.        ]]
    
    Cosine similarity
    [[1.         0.34430458 0.16450509 0.37391712 0.3479721  0.18840894]
     [0.34430458 1.         0.37091192 0.46132163 0.20500145 0.        ]
     [0.16450509 0.37091192 1.         0.23154573 0.14566346 0.        ]
     [0.37391712 0.46132163 0.23154573 1.         0.3172916  0.12054426]
     [0.3479721  0.20500145 0.14566346 0.3172916  1.         0.2243601 ]
     [0.18840894 0.         0.         0.12054426 0.2243601  1.        ]]