Search code examples
perlzshcjkone-liner

Issue matching Chinese characters in Perl one liner using \p{script=Han}


I'm really stumped by trying to match Chinese characters using a Perl one liner in zsh. I canot get \p{script=Han} to match Chinese characters, but \P{script=Han} does.

Task: I need to change this:

一  
<lb/> 二

to this:

<tag ref="一二">一
<lb/> 二</tag>

There could be a variable number of tags, newlines, whitespaces, tabs, alphanumeric characters, digits, etc. between the two Chinese characters. I believe the most efficient and robust way to do this would be to look for something that is *not a Chinese character.

My attempted solution:

perl -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g'

This has the desired effect when applied to the example above.

Problem: The issue I am having is that \P{script=Han} (or \p{^script=Han}) matches Chinese characters as well.

When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters. When trying to match \P{script=Han}, the regex matches every character in the file.

I don't know why.

This is a problem because in the case of this situation, the output is not as desired:

一
<lb/> 三二

becomes

<tag ref="一二">一
<lb/> 三二</tag>

I don't want this to be matched at all- just instances where 一 and 二 are separated only by characters that are not Chinese characters.

Can anyone tell me what I'm doing wrong? Or suggest a workaround? Thanks!


Solution

  • When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters.

    The problem is that both your script and your input file are UTF-8 encoded, but you do not say so to perl. If you do not tell perl, it will assume that they are ASCII encoded.

    To say that your script is UTF-8 encoded, use the utf8 pragma. To tell perl that all files you open are UTF-8 encoded, use the -CD command line option. So the following oneliner should solve your problem:

    perl -Mutf8 -CD -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g' file