Search code examples
pythondata-cleaning

Editing data encapsulated in flags from text file


I am currently cleaning data from text files. And the files contains transcriptions of speeches from daily conversations. Some of the files are multilingual, a few examples of a multilingual portion are like so:

around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine

There can be multiple of such other languages in one file

Going back to the first example, what I am trying to do with the data is to remove "<tamil>", "அம்மா:" and "</tamil>", keeping just the english pronunciation of the word. I have tried to replace the <tamil> to "", but am quite unsure of how to approach the removal of the tamil words.

The expected output would be:

around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine

How would I go about doing so?


Solution

  • Yes, Pls try this

    content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"
    
    ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
    outputs=ft.encode('ascii','ignore')
    
    print(outputs.decode('utf-8')) 
    
    ​
    

    output

    around that area, :ammaa would have cooked too
    

    It's not complete output..Like if you see final string there some extra things like ":", some punctuations..So pls edit them yourself using regex..I've posted 99% of the answer