Extract only text in Hindi from a file containing both Hindi and English

I have a file containing lines like

 ted    1-1 1.0 politicians do not have permission to do what needs to be 
 done.  

 राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.

I have to write a program which reads the file line by line and gives the output in a file containing only the Hindi part. Here the first word indicates the source of the last two segments. Also, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus out of this file.

Solution

you can do this by checking Unicode character.

import codecs,string
def detect_language(character):
    maxchar = max(character)
    if u'\u0900' <= maxchar <= u'\u097f':
        return 'hindi'

with codecs.open('letter.txt', encoding='utf-8') as f:
    input = f.read()
    for i in input:
        isEng = detect_language(i)
        if isEng == "hindi":
            #Hindi Character
            #add this to another file
            print(i,end="\t")
            print(isEng)

Hope this helps