Search code examples
perlsedawkgrepnawk

Extract a specific pattern from lines with sed, awk or perl


Can I use sed if I need to extract a pattern enclosed by a specific pattern, if it exists in a line?

Suppose I have a file with the following lines :

There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

In both the cases I have to scan the line for the first occurring pattern i.e ' [/ ' or '/* ' in their respective cases and store the following pattern till then exit pattern i.e ' /] 'or ' */ ' respectively .

In short , I need fear and answer .If possible , Can it be extended for multiple lines ;in the sense ,if the exit pattern occurs in a line different than the same .

Any kind of help in the form of suggestions or algorithms are welcome. Thanks in advance for the replies


Solution

  • use strict;
    use warnings;
    
    while (<DATA>) {
        while (m#/(\*?)(.*?)\1/#g) {
            print "$2\n";
        }
    }
    
    
    __DATA__
    There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
    Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
    

    As a one-liner:

    perl -nlwe 'while (m#/(\*?)(.*?)\1/#g) { print $2 }' input.txt
    

    The inner while loop will iterate between all matches with the /g modifier. The backreference \1 will make sure we only match identical open/close tags.

    If you need to match blocks that extend over multiple lines, you need to slurp the input:

    use strict;
    use warnings;
    
    $/ = undef;
    while (<DATA>) {
        while (m#/(\*?)(.*?)\1/#sg) {
            print "$2\n";
        }
    }
    
    __DATA__
        There are many who dare not kill themselves for [/fear/] of what the neighbors will say. /* foofer */ 
        Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
    foo bar /
    baz 
    baaz / fooz
    

    One-liner:

    perl -0777 -nlwe 'while (m#/(\*?)(.*?)\1/#sg) { print $2 }' input.txt
    

    The -0777 switch and $/ = undef will cause file slurping, meaning all of the file is read into a scalar. I also added the /s modifier to allow the wildcard . to match newlines.

    Explanation for the regex: m#/(\*?)(.*?)\1/#sg

    m#              # a simple m//, but with # as delimiter instead of slash
        /(\*?)      # slash followed by optional *
            (.*?)   # shortest possible string of wildcard characters
        \1/         # backref to optional *, followed by slash
    #sg             # s modifier to make . match \n, and g modifier 
    

    The "magic" here is that the backreference requires a star * only when one is found before it.