Search code examples
pythonfileunicode

Extract only text in Hindi from a file containing both Hindi and English


I have a file containing lines like

 ted    1-1 1.0 politicians do not have permission to do what needs to be 
 done.  

 राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.

I have to write a program which reads the file line by line and gives the output in a file containing only the Hindi part. Here the first word indicates the source of the last two segments. Also, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus out of this file.


Solution

  • you can do this by checking Unicode character.

    import codecs,string
    def detect_language(character):
        maxchar = max(character)
        if u'\u0900' <= maxchar <= u'\u097f':
            return 'hindi'
    
    with codecs.open('letter.txt', encoding='utf-8') as f:
        input = f.read()
        for i in input:
            isEng = detect_language(i)
            if isEng == "hindi":
                #Hindi Character
                #add this to another file
                print(i,end="\t")
                print(isEng)
    

    Hope this helps