Search code examples
pythonunicodeunicode-stringhindi

Python unicode search not giving correct answer


I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:

import codecs

hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []

for counter, line in enumerate(hypernyms):
    count_arr.append(0)
    for word in words:
        if line.find(word) >=0:
            count_arr[counter] +=1

for iterator, count in enumerate(count_arr):
if count>0:
    print iterator, ' ', count

This is finding some words, but ignoring some others The input files are: File-1:

पौधा  
वनस्पति

File-2:

वनस्पति, पेड़-पौधा  
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग  
पादप_समूह, पेड़-पौधे, वनस्पति_समूह  
पेड़-पौधा

This gives output:

0 1  
3 1

Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?


Solution

  • That because You don't remove the "\n" charactor at the end of lines. So you don't search "some_pattern\n", not "some_pattern". Use strip() function to chop them off like this:

    import codecs
    
    words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
    hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
    count_arr = []
    
    for line in hypernyms:
        count_arr.append(0)
        for word in words:
            count_arr[-1] += (word in line)
    
    for count in enumerate(count_arr):
        if count:
            print iterator, ' ', count