Search code examples
regexperlgrep

grep/perl regex for finding a header and a matching line


Let's say I have a file, called courses.txt with contents like below. the file has sections(course providers and my email used) followed by various courses. example : edX ([email protected]) and then the various course names, each preceded by the serial number.

udemy ([email protected])  
"=========================="-  
1) foo bar
2) java programming language
3) redis stephen grider
4) javascript
5) react with typescript
6) kotlin
7) Etherium and Solidity : the Complete Developer's Guide
8) reactive programming with spring  


coursera ([email protected])  
"==========================-"  
1) python
2) typescript
3) java concurrency
4) C#

edX ([email protected])  
"==========================-"  
1) excel
2) scala
3) risk management
4) stock
5) oracle
6) mysql  
7) java  
==========================-    

Question : I want to grep for a course, say "java". I want a match which shows me the particular line(s) of the match(example : "java") and the corresponding section name(say, "edX ([email protected])" ).

if I want to search for "java" what "regex" will give me following matches (I use grep/perl on windows):

  <br>
udemy ([email protected])    
2) java programming language  

coursera ([email protected])  
3) java concurrency

edX ([email protected])    
7) java    

I tried lookbehind/lookahead but couldn't figure out how to print the course provider name with email and the course name.

Thoughts?


Solution

  • If you process in paragraphs (chunks of text separated by blank lines) then in each paragraph it is fairly straightforward to match the needed pattern -- the header (followed by a line with ='s) and a line with java in it

    perl -00 -wnE'say "$1\n$2" 
        if /(.+?) \n "=+.+? \n .+? \n ([^\n]+\sjava\s[^\n]+)/sx' file
    

    (Tested on Linux; read on for Windows. Broken into lines for easier reading. See below for explanation of the pattern.)

    At the end of the line with =s I use .+? instead of the specific characters that follow =s in your input because your sample input isn't consistent; it has both -" and "-, in different paragraphs. Adjust as suitable.

    Since this is on Windows, where you may have to use " delimiters for the one-liner (I don't know what shell you use), you may need to replace the literal " inside the pattern with \x22 (hex for "), or your other favorite sequence.

    Hopefully good for Windows (can't test on Windows right now)

    perl -00 -wnE "say qq($1\n$2) 
        if /(.+?)\n \x22=+.+? \n .+? \n ([^\n]+\sjava\s[^\n]+) /sx
    " file
    

    The -00 switch makes it read in paragraphs. With the /x modifier spaces inside the pattern are ignored so we can use them to space things out for readability. With the /s modifier the . matches a newline as well. This is important for the middle .+? to matche multiple lines, up to the one with java (surrounded by spaces).

    If you don't mind having a script instead of a one-liner, what I recommend, then, for example

    use warnings;
    use strict;
    use feature 'say';
    
    local $/ = "\n\n";
    
    while (<>) { 
        say "$1\n$2" 
            if /(.+?) \n "=+.+? \n .+? \n ([^\n]+ \sjava\s [^\n]+)/sx;
    }
    

    The <> operator reads files given on the command line, line by line, but the notion of a "line" is earlier set to a paragraph with local $/ = "\n\n". That local is there in case this is a part of a larger program where you don't want to change the $/ variable for the whole program!


    Or, instead of using /s that makes . match newlines, use a pattern for multiple lines

    perl -00 -wnE'say "$1\n$2" 
        if /(.+) \n "=+.+ \n (?:.+\n)* (.+\sjava\s.+)/x' file
    

    Or, if you need "..." on Windows, like

    perl -00 -wnE "say qq($1\n$2) 
        if /(.+) \n \x22=+.+ \n (?:.+\n)* (.+\sjava\s.+)/x' file
    

    (Again, I can't test on Windows right now.)

    Note that now we don't have to make all those .+ non-greedy with the added ? (.+?) like in the patterns with /s above -- now that .+ stops at a newline, just as needed here.

    Or, use the /s modifier dynamically, via extended patterns

    perl -00 -wnE "say "$1\n$2" 
        if /(.+) \n \x22=+.+ \n (?s).+?(?-s) (.+\sjava\s.+)/x
    " file
    

    Here (?s) "turns on" the /s modifier, which would be in effect until the end of the enclosing group (the rest of the pattern in this case), but (?-s) turns it off.