Search code examples
regexlinuxperlksh

Perl regular expression substitution with groups


I have the following JSON input

... "somefield":"somevalue", "time":"timevalue", "anotherfield":"value" ...

inside my KornShell (ksh) script, and I wish to replace timevalue with my value. So I created this regular expression using groups with works just fine

data=`cat somefile.json`
echo $data | perl -pe "s|(.*time\"\s*\:\s*\").*?(\".*)|\1%TIME%\2|g" | another-script.sh

... "somefield":"somevalue", "time":"%TIME%", "anotherfield":"value" ...

However ... I cannot use number as substitution because Perl uses numbers to define groups .. so this one obviously doesn’t work:

perl -pe "s|(.*time\"\s*\:\s*\").*?(\".*)|\120:00:00\2|g"

I can overcome this by doing a two-step substitution,

perl -pe "s|(.*time\"\s*\:\s*\").*?(\".*)|\1%TIME%\2|g" | perl -pe "s|%TIME%|20:00:00|"

... "somefield":"somevalue", "time":"20:00:00", "anotherfield":"value" ...

but I am sure there is a better and more elegant way to do it.


Solution

  • Perl doesn't use \1 for substitution. If you had enabled warnings (e.g., with perl -w), Perl would have told you it's $1. Which can be disambiguated from surrounding digits by adding { }:

    perl -pe 's|(.*time"\s*:\s*").*?(".*)|${1}20:00:00$2|g'
    

    (I also removed all the redundant backslashes from the regex.)

    On another note, what's the point of matching .* if you're just going to replace it by itself? Couldn't it just be

    perl -pe 's|(time"\s*:\s*").*?(")|${1}20:00:00$2|g'
    

    ?

    I'm not a big fan of .* or .*?. If you're trying to match the inside of a quoted string, it would be better to be specific:

    perl -pe 's|(time"\s*:\s*")[^"]*(")|${1}20:00:00$2|g'
    

    We're not trying to validate the input string, so now there's really no reason to match that final " (and replace it by itself) either:

    perl -pe 's|(time"\s*:\s*")[^"]*|${1}20:00:00|g'
    

    If your Perl is not ancient (5.10+), you can use \K to "keep" leading parts of the string, i.e. not include it in the match:

    perl -pe 's|time"\s*:\s*"\K[^"]*|20:00:00|g'
    

    Now only the [^"]* part will be substituted, saving us from having to do any capturing.