Search code examples
regexbatch-filewindows-scriptingjrepl

JREPL to match URL that contains keyword


I am trying to use JREPL.bat to match URLs containing a specific term in a txt file (and then write the result back to the file).

This is what I have so far, unfortunately it is not returning the expected result. The result is always NULL:

JREPL.bat "href=""(\w[^""]+/pdf4v/\w[^""]+)" "" /match /f html.txt /o -

The html.txt looks as follows (in reality the file is much more complex; additional content represented by [...]):

[...]

<ul>
<li><a href="#" id="fav" onclick="return favoritesadd(8094,'fav.png','removefav.png');"><img id="fav8094" src="fav.png" alt="" border="0" /> <span id="fav8094">ADD TO WISHLIST</span></a></li>
<li class="sixcol right"><a href="https://documents.domain.com/content/updates/year18/jv/folder01/pdf/pdf8094.zip?exp=1567791065&hsh=5a49e7d4828603beddbfb058a1535f5e&dl=att&filename=pdf-00008094-16.pdf" class="tcenter"><img src="pdf.png" class="icon" align="left" />16<br /><span class="small">download pdf</span></a></li>
<li class="sixcol"><a href="https://documents.domain.com/content/updates/year18/jv/folder01/pdf4v/pdf4v8094.zip?exp=1567791065&hsh=246a7702296f7db363ecaa1746a8815a&dl=att&filename=pdf-00008094-40.pdf" class="tcenter"><img src="pdf.png" class="icon"  align="left" />40<br /><span class="small">download pdf</span></a></li>
<div class="clear"></div>
<li><a href="/details.php?id=8094&num=1&ss=1" onclick="$.open();return false;"><img src="/images/details.png" class="center" />Details</a></li>
</ul>

[...]

The expected outcome is:

https://documents.domain.com/content/updates/year18/jv/folder01/pdf4v/pdf4v8094.zip?exp=1567791065&hsh=246a7702296f7db363ecaa1746a8815a&dl=att&filename=pdf-00008094-40.pdf

Can anyone help? I am not sure why this isn't working.

Thanks in advance for your help!


Solution

  • The following single command line could be used in the batch file with following preconditions:

    1. jrepl.bat must be in directory containing the batch file containing this line.
    2. The file html.txt must be in current directory on execution of this batch file.
    3. The file html.txt must not contain multiple URLs with /pdf4v/ in one line.
    4. The file html.txt contains /pdf4v/ not outside a URL.

    The batch file command line:

    @%SystemRoot%\System32\findstr.exe /R "href=.*/pdf4v/" html.txt | call "%~dp0jrepl.bat" "^.*href=\x22([^\x22]*?/pdf4v/[^\x22]*)\x22.*$" "$1" /O html.txt
    

    FINDSTR supports regular expressions just very limited and outputs always the entire line containing a matched string. So the case sensitive regular expression search string href=.*/pdf4v/ finds all lines containing href= and /pdf4v/.

    Those lines are output by FINDSTR to handle STDOUT which is redirected by Windows command processor to handle STDIN of JREPL.BAT.

    JREPL.BAT runs a much more powerful JScript regular expression replace to match everything on a line definitely containing href= and /pdf4v/ with marking the URL containing /pdf4v/ and replacing the line just by the marked URL.

    The search expression ^.*href="([^"]*?/pdf4v/[^"]*)".*$ is written in batch file with \x22 for each " as cmd.exe interprets a double quote as begin/end of an argument string.

    There is an even better solution using JREPL.BAT option /MATCH:

    @call "%~dp0jrepl.bat" "[^\x22]*?/pdf4v/[^\x22]*" "" /MATCH /F html.txt /O -
    

    The search expression [^"]*?/pdf4v/[^"]* matches simply all strings consisting of 0 or more characters not being a double quote or a newline character non-greedy and /pdf4v/ and 0 or more characters not being a double quote or a newline character. That is very simple and can result in false positives, but works for provided example.

    JScript regular expression engine supports unfortunately not look-behind or other enhanced features of modern regular expression engines to limit the search on href values. But some false positives can be avoided using:

    @call "%~dp0jrepl.bat" "[^\x22]*?/pdf4v/[^\x22]*" "" /INC "/href=\x22[^\x22]*?\/pdf4v\//" /MATCH /F html.txt /O -
    

    Lines not containing href="[^"]*?/pdf4v/ are filtered out by this include filter before applying the simple search expression. That is still not perfect, but perhaps good enough for this task.