Search code examples
htmlregexexcelnotepad++excel-2016

Regex remove english text from mixed chinese-english sentences using Notepad++ and Excel?


I work with Notepad++ and Excel. I have data that contains text in English and Chinese.

The data structure is as follows:

<p> chinese text</p>
<p> english text</p>
<p> chinese text</p>
<p> english text</p>
<p> chinese text</p>
<p> english text</p>

How to delete all English text and also symbols between < p> and < /p> ?

So just leave the Chinese text between < p> and < /p>

So the result is like this:

<p> chinese text</p>
<p> chinese text</p>
<p> chinese text</p>

I tried to delete English text by removing ascii characters using regex, but there is an English text that was missed.

See this pic: PIC Please help me, thanks


Solution

  • You should be able to do this using Notepad++:

    • replace <p>[a-zA-Z"].*$ to empty string (regex replace mode)
    • replace \n\n to \n (extended replace mode)
    • replace <p>|</p> to empty string (regex replace mode)