Let's say I have a file, called courses.txt with contents like below. the file has sections(course providers and my email used) followed by various courses. example : edX ([email protected]) and then the various course names, each preceded by the serial number.
udemy ([email protected])
"=========================="-
1) foo bar
2) java programming language
3) redis stephen grider
4) javascript
5) react with typescript
6) kotlin
7) Etherium and Solidity : the Complete Developer's Guide
8) reactive programming with spring
coursera ([email protected])
"==========================-"
1) python
2) typescript
3) java concurrency
4) C#
edX ([email protected])
"==========================-"
1) excel
2) scala
3) risk management
4) stock
5) oracle
6) mysql
7) java
==========================-
Question : I want to grep for a course, say "java". I want a match which shows me the particular line(s) of the match(example : "java") and the corresponding section name(say, "edX ([email protected])" ).
if I want to search for "java" what "regex" will give me following matches (I use grep/perl on windows):
<br>
udemy ([email protected])
2) java programming language
coursera ([email protected])
3) java concurrency
edX ([email protected])
7) java
I tried lookbehind/lookahead but couldn't figure out how to print the course provider name with email and the course name.
Thoughts?
If you process in paragraphs (chunks of text separated by blank lines) then in each paragraph it is fairly straightforward to match the needed pattern -- the header (followed by a line with =
's) and a line with java
in it
perl -00 -wnE'say "$1\n$2"
if /(.+?) \n "=+.+? \n .+? \n ([^\n]+\sjava\s[^\n]+)/sx' file
(Tested on Linux; read on for Windows. Broken into lines for easier reading. See below for explanation of the pattern.)
At the end of the line with =
s I use .+?
instead of the specific characters that follow =
s in your input because your sample input isn't consistent; it has both -"
and "-
, in different paragraphs. Adjust as suitable.
Since this is on Windows, where you may have to use "
delimiters for the one-liner (I don't know what shell you use), you may need to replace the literal "
inside the pattern with \x22
(hex for "
), or your other favorite sequence.
Hopefully good for Windows (can't test on Windows right now)
perl -00 -wnE "say qq($1\n$2)
if /(.+?)\n \x22=+.+? \n .+? \n ([^\n]+\sjava\s[^\n]+) /sx
" file
The -00 switch makes it read in paragraphs. With the /x
modifier spaces inside the pattern are ignored so we can use them to space things out for readability. With the /s
modifier the .
matches a newline as well. This is important for the middle .+?
to matche multiple lines, up to the one with java
(surrounded by spaces).†
If you don't mind having a script instead of a one-liner, what I recommend, then, for example
use warnings;
use strict;
use feature 'say';
local $/ = "\n\n";
while (<>) {
say "$1\n$2"
if /(.+?) \n "=+.+? \n .+? \n ([^\n]+ \sjava\s [^\n]+)/sx;
}
The <> operator reads files given on the command line, line by line, but the notion of a "line" is earlier set to a paragraph with local $/ = "\n\n"
. That local is there in case this is a part of a larger program where you don't want to change the $/ variable for the whole program!
† Or, instead of using /s
that makes .
match newlines, use a pattern for multiple lines
perl -00 -wnE'say "$1\n$2"
if /(.+) \n "=+.+ \n (?:.+\n)* (.+\sjava\s.+)/x' file
Or, if you need "..."
on Windows, like
perl -00 -wnE "say qq($1\n$2)
if /(.+) \n \x22=+.+ \n (?:.+\n)* (.+\sjava\s.+)/x' file
(Again, I can't test on Windows right now.)
Note that now we don't have to make all those .+
non-greedy with the added ?
(.+?
) like in the patterns with /s
above -- now that .+
stops at a newline, just as needed here.
Or, use the /s
modifier dynamically, via extended patterns
perl -00 -wnE "say "$1\n$2"
if /(.+) \n \x22=+.+ \n (?s).+?(?-s) (.+\sjava\s.+)/x
" file
Here (?s)
"turns on" the /s
modifier, which would be in effect until the end of the enclosing group (the rest of the pattern in this case), but (?-s)
turns it off.