I am trying to use JREPL.bat to match URLs containing a specific term in a txt file (and then write the result back to the file).
This is what I have so far, unfortunately it is not returning the expected result. The result is always NULL:
JREPL.bat "href=""(\w[^""]+/pdf4v/\w[^""]+)" "" /match /f html.txt /o -
The html.txt
looks as follows (in reality the file is much more complex; additional content represented by [...]):
[...]
<ul>
<li><a href="#" id="fav" onclick="return favoritesadd(8094,'fav.png','removefav.png');"><img id="fav8094" src="fav.png" alt="" border="0" /> <span id="fav8094">ADD TO WISHLIST</span></a></li>
<li class="sixcol right"><a href="https://documents.domain.com/content/updates/year18/jv/folder01/pdf/pdf8094.zip?exp=1567791065&hsh=5a49e7d4828603beddbfb058a1535f5e&dl=att&filename=pdf-00008094-16.pdf" class="tcenter"><img src="pdf.png" class="icon" align="left" />16<br /><span class="small">download pdf</span></a></li>
<li class="sixcol"><a href="https://documents.domain.com/content/updates/year18/jv/folder01/pdf4v/pdf4v8094.zip?exp=1567791065&hsh=246a7702296f7db363ecaa1746a8815a&dl=att&filename=pdf-00008094-40.pdf" class="tcenter"><img src="pdf.png" class="icon" align="left" />40<br /><span class="small">download pdf</span></a></li>
<div class="clear"></div>
<li><a href="/details.php?id=8094&num=1&ss=1" onclick="$.open();return false;"><img src="/images/details.png" class="center" />Details</a></li>
</ul>
[...]
The expected outcome is:
https://documents.domain.com/content/updates/year18/jv/folder01/pdf4v/pdf4v8094.zip?exp=1567791065&hsh=246a7702296f7db363ecaa1746a8815a&dl=att&filename=pdf-00008094-40.pdf
Can anyone help? I am not sure why this isn't working.
Thanks in advance for your help!
The following single command line could be used in the batch file with following preconditions:
jrepl.bat
must be in directory containing the batch file containing this line.html.txt
must be in current directory on execution of this batch file.html.txt
must not contain multiple URLs with /pdf4v/
in one line.html.txt
contains /pdf4v/
not outside a URL.The batch file command line:
@%SystemRoot%\System32\findstr.exe /R "href=.*/pdf4v/" html.txt | call "%~dp0jrepl.bat" "^.*href=\x22([^\x22]*?/pdf4v/[^\x22]*)\x22.*$" "$1" /O html.txt
FINDSTR supports regular expressions just very limited and outputs always the entire line containing a matched string. So the case sensitive regular expression search string href=.*/pdf4v/
finds all lines containing href=
and /pdf4v/
.
Those lines are output by FINDSTR to handle STDOUT which is redirected by Windows command processor to handle STDIN of JREPL.BAT.
JREPL.BAT runs a much more powerful JScript regular expression replace to match everything on a line definitely containing href=
and /pdf4v/
with marking the URL containing /pdf4v/
and replacing the line just by the marked URL.
The search expression ^.*href="([^"]*?/pdf4v/[^"]*)".*$
is written in batch file with \x22
for each "
as cmd.exe
interprets a double quote as begin/end of an argument string.
There is an even better solution using JREPL.BAT option /MATCH
:
@call "%~dp0jrepl.bat" "[^\x22]*?/pdf4v/[^\x22]*" "" /MATCH /F html.txt /O -
The search expression [^"]*?/pdf4v/[^"]*
matches simply all strings consisting of 0 or more characters not being a double quote or a newline character non-greedy and /pdf4v/
and 0 or more characters not being a double quote or a newline character. That is very simple and can result in false positives, but works for provided example.
JScript regular expression engine supports unfortunately not look-behind or other enhanced features of modern regular expression engines to limit the search on href
values. But some false positives can be avoided using:
@call "%~dp0jrepl.bat" "[^\x22]*?/pdf4v/[^\x22]*" "" /INC "/href=\x22[^\x22]*?\/pdf4v\//" /MATCH /F html.txt /O -
Lines not containing href="[^"]*?/pdf4v/
are filtered out by this include filter before applying the simple search expression. That is still not perfect, but perhaps good enough for this task.