Search code examples
pythonlistnltktokenize

How do I check for specific words in a list of tokenized sentences and then mark them as one or zero?


I am trying to map specific words in a list to another list of tokenized sentences and if the word is found in the sentence then I append a 1 to a list of its category and 0 to the rest of categories. For example:

category_a=["stain","sweat","wet","burn"]
category_b=["love","bad","favorite"]
category_c=["packaging","delivery"]
tokenized_sentences=['this deodorant does not stain my clothes','i love this product','i sweat all day']
for i in category_a:
    for j in tokenized_sentences:
          if(i in nltk.word_tokenize(j)):
                 list_a.append(j)
                 tag_a,tag_b,tag_c=([],)*3
                 tag_a.append(1)
                 tag_b.append(0)
                 tag_c.append(0)
                 final=tag_a+tag_b+tag_c

Similarly for category_b and category_c

Expected output:this deodorant does not stain my clothes-->[1,0,0]
                i love this product-->[0,1,0]
                i sweat all day-->[1,0,0]
                great fragrance-->[0,0,0]

I am getting duplicate outputs for each sentence like: i love this product-->[1,0,0] i love this product-->[1,0,0] and also like this:[i love this product,i sweat all day]-->[0,1,0]

Also, if a sentence has words from two different categories Ex: 'this product does not stain and i love it'
the expected output would be [1,1,0] 

How do I get the output in the required format?


Solution

  • This should do the job:

    category_b = ["love", "bad", "favorite"]
    category_c = ["packaging", "delivery"]
    sentences = ['this deodorant does not stain my clothes', 'i love this product', 'i sweat all day']
    
    results = []
    
    for sentence in sentances:
        cat_a = 0
        cat_b = 0
        cat_c = 0
        for word in sentance.split():
            if cat_a == 0:
                cat_a = 1 if word in category_a else 0
            if cat_b == 0:
                cat_b = 1 if word in category_b else 0
            if cat_c == 0:
                cat_c = 1 if word in category_c else 0
    
        results.append((sentance, [cat_a, cat_b, cat_c]))
    
    
    print(results)
    

    This code will check if each sentence contains word of each of the given categories and save both of them (the sentence and result) in form of a tuple. All tuples will be appended to a list called results.

    Output:

    [('this deodorant does not stain my clothes', [1, 0, 0]), ('i love this product', [0, 1, 0]), ('i sweat all day', [1, 0, 0])]