I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:
import codecs
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []
for counter, line in enumerate(hypernyms):
count_arr.append(0)
for word in words:
if line.find(word) >=0:
count_arr[counter] +=1
for iterator, count in enumerate(count_arr):
if count>0:
print iterator, ' ', count
This is finding some words, but ignoring some others The input files are: File-1:
पौधा
वनस्पति
File-2:
वनस्पति, पेड़-पौधा
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग
पादप_समूह, पेड़-पौधे, वनस्पति_समूह
पेड़-पौधा
This gives output:
0 1
3 1
Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?
That because You don't remove the "\n" charactor at the end of lines. So you don't search "some_pattern\n", not "some_pattern". Use strip() function to chop them off like this:
import codecs
words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []
for line in hypernyms:
count_arr.append(0)
for word in words:
count_arr[-1] += (word in line)
for count in enumerate(count_arr):
if count:
print iterator, ' ', count