Search code examples
sed

SED only returns the second value with two matches. Any thoughts?


I was testing a sed query that looks like this:

sed -nE 's/.*(49 \[0-9\]{3} \[0-9\]{7}).*/\\1/gp' infile.in \> outfile3.dat

The file infile.in has the following garbage string in it on a single line.

serialised subrules pouring urometer iscose rehandled dialyzate 49 378 7647137 predicative infausting syncarida oyers voicers dioc hartal gulonic overearly crescentwise apostatic 49 046 6728421 agglutinates skilletfishes quantummechanical rumaging recommend cryptorchism sympathize songlet .

The result outfile3.dat contains only :

49 046 6728421

That is the second instance on the line. It seems the g switch should get both phone numbers into the file but not the case.

Any thoughts? Need to use SED for this example. I can do it with awk or python but really want to understand the sed better.

Wanting to find all instances of phone numbers in a file but if more than a single match occurs on a given line, the first is dropped despite the "g" flag in the command. Not sure why.


Solution

  • The reason your script fails is that you use .* which does greedy matching:

    • The first .* gobbles up as much of the line as possible, while still allowing the rest of the regex to match. So all except the final number are lost.
    • If the first .* were not present, the second .* would gobble up the tail of the line and so all except the first number would be lost.

    If the script only needs to find and print the numbers, it will be much simpler to use something other than sed.

    For example, by using a grep that has -o, or with Perl:

    grep -Eo '49 [0-9]{3} [0-9]{7}' infile >outfile
    
    perl -nE 'say for /49 \d{3} \d{7}/g' infile >outfile
    

    Note: Without extra checks, the regex matches part of strings like:

    1234549 123 123456789
    

    If you really must use sed, it is possible, but much more complicated.

    It involves a multi-stage process:

    • split line into numbers and non-numbers
    • delete non-numbers
    • print result

    The complicated part is deciding what sections are "non-number".

    One method is to flag them during the split. For example:

    sed -nE '
        /49 [0-9]{3} [0-9]{7}/ {        # skip if no numbers
            s//!\n&\n!/g                # delimit and flag
            s/^[^\n]*!\n|\n![^\n]*//g   # delete flagged
            p                           # print
        }
    ' infile >outfile
    

    The first substitution splits the line, placing each number onto a new separate line and flagging the other new lines with a character that doesn't appear in numbers (!):

    serialised subrules pouring urometer iscose rehandled dialyzate !
    49 378 7647137
    ! predicative infausting syncarida oyers voicers dioc hartal gulonic overearly crescentwise apostatic !
    49 046 6728421
    ! agglutinates skilletfishes quantummechanical rumaging recommend cryptorchism sympathize songlet .
    

    Then the second substitution deletes all lines that contain the flag character.

    Similar idea in two passes:

    sed -nE 's/49 [0-9]{3} [0-9]{7}/!\n&\n!/gp' infile |
    sed '/!/d' >outfile
    

    Note: Using \n in the replacement is not portable.

    For portability, escaped literal newlines are required:

    sed -nE '
        /49 [0-9]{3} [0-9]{7}/ {
            s//!\
    &\
    !/g
            s/^[^\n]*!\n|\n![^\n]*//g
            p
        }
    ' infile >outfile
    
    sed -nE 's/49 [0-9]{3} [0-9]{7}/!\
    &\
    !/gp' infile |
    sed '/!/d' >outfile