I have a file containing lines like
ted 1-1 1.0 politicians do not have permission to do what needs to be
done.
राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.
I have to write a program which reads the file line by line and gives the output in a file containing only the Hindi part. Here the first word indicates the source of the last two segments. Also, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus out of this file.
you can do this by checking Unicode character.
import codecs,string
def detect_language(character):
maxchar = max(character)
if u'\u0900' <= maxchar <= u'\u097f':
return 'hindi'
with codecs.open('letter.txt', encoding='utf-8') as f:
input = f.read()
for i in input:
isEng = detect_language(i)
if isEng == "hindi":
#Hindi Character
#add this to another file
print(i,end="\t")
print(isEng)
Hope this helps