Search code examples
pythonpandasnlpstring-matching

How can I get back the string from its BoW vector?


I have generated BoW for a pandas dataframae column called tech_raw_data['Product lower'].

count_vect = CountVectorizer()
smer_counts = count_vect.fit_transform(tech_raw_data['Product lower'].values.astype('U'))
smer_vocab = count_vect.get_feature_names()

Next to test string similarities with this BoW vectors I created BoW for only one entry in a column in a dataframe, toys['ITEM NAME'].

 toys = pd.read_csv('toy_data.csv', engine='python')
 print('-'*80)
 print(toys['ITEM NAME'].iloc[0])
 print('-'*80)
 inp = [toys['ITEM NAME'].iloc[0]]

 cust_counts = count_vect.transform(inp)
 cust_vocab = count_vect.get_feature_names()

Checking similarities:

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for x in cust_counts[0].toarray():
    for y in smer_counts.toarray():
        ratio = similar(x, y)
        #print(ratio)
        if ratio>=0.85:
            should print the string corresponding to BoW y

Now whenever the match ratio exceeds 0.85, I need to print the string corresponding to the smer_counts in tech_raw_data['Product lower'] dataframe.


Solution

  • for x in cust_counts[0].toarray():
        for i, y in enumerate(smer_counts.toarray()):
            ratio = similar(x, y)
            #print(ratio)
            if ratio>=0.85:
                print (tech_raw_data.loc[i, 'Product lower'])
    

    Enumerate the numpy array returned by smer_counts.toarray() and use the index when the ratio>=0.85 to get the corresponding text in the tech_raw_data dataframe.

    This is valid because len(smer_counts.toarray()) == len(tech_raw_data) and also the order of records in the dataframe is preserved.