Search code examples
htmlregexbashsedpcre

How do I do a regex only the specific selection between two tags?


There have been dozens of similar questions that was asked but my question is about a specific selection between the tags. I don't want the entire selection from <a href to </a>, I only need to target the "> between those tags itself.

I am trying to convert a href links into wikilinks. For example, if the sample text has:

<a href="./light.html">Light</a> is light.

<div class="reasons">

I wanted to edit the file itself and change from <a href="link.html">Link</a> into [[link.html|Link]]. The basic idea that I have right now uses 3 sed edits as follows:

  1. <a href="link.html">Link</a> -> <a href="link.html|Link</a>
  2. <a href="link.html|Link</a> -> [[link.html|Link</a>
  3. [[link.html|Link</a> -> [[link.html|Link]]

My problem lies with the first step; I can't find the regex that only targets "> between <a href and </a>.

I understand that the basic idea would need to be the search target between lookaround and lookbehind. But trying it on regexr showed a fail. I also tried using conditional regex. I can't find the syntax I used but it either turned an error or it worked but it also captured the div class.

Edit: I'm on Ubuntu and using a bash script using sed to do the text manipulation.


Solution

  • The basic idea that I have right now uses 3 sed edits

    Assuming you've also read the answers underneath those dozens of similar questions, you could've known that it's a bad idea to parse HTML with sed (regex).

    With an HTML-parser like this would be as simple as:

    $ xidel -s '<a href="link.html">Link</a>' -e 'concat("[[",//a/@href,"|",//a,"]]")'
    $ xidel -s '<a href="link.html">Link</a>' -e '"[["||//a/@href||"|"||//a||"]]"'
    $ xidel -s '<a href="link.html">Link</a>' -e 'x"[[{//a/@href}|{//a}]]"'
    [[link.html|Link]]
    

    Three different queries to concatenate strings. The 1st query uses the XPath concat() function, the 2nd query uses the XPath || operator and the 3rd uses xidel's extended string syntax.