The short and immediate version of the question is: Why are these two regex different? i.e.,
href=(['"]).+?\1
vs
href=(['"]).+?['"]
or href=(['"]).+?(['"])
I am practicing regex on this site and I am trying to solve this level
http://play.inginf.units.it/#/level/6
I am posting the entire content here in case the site goes down in future.
<tr>
<a href="javascript:openurl('/Xplore/accessinfo.jsp')" class="topUnderlineLinks">
<A href="/iel5/4235/4079606/04079617.pdf?tp=&arnumber=4079617&isnumber=4079606" class="bodyCopy">PDF</A>(3141 KB)
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
<td width="33%" ><div align="right"> <a href="/xplorehelp/Help_start.html#Help_searchresults.html" class="subNavLinks" target="blank">Help</a> <a href="/xpl/contactus.jsp" class="subNavLinks">Contact
Kimya ile ilgili çeþitli temel referans
<a href="http://search.epnet.com/login.asp?profile=web&defaultdb=geh"
<a href="http://iimpft.chadwyck.com/" target="_parent">International
<a href="standartlar.html#tse" target="_parent">NFPA Standartlarý</a>
<a href="http://www.gutenberg.org/" target="_parent">Project Gutenberg</a>
<a href="http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&uicode=istanbultek"
<a href="http://www.scitation.org" target="_parent">Scitation</a>
dergilerin listesini görmek için <a href="/online/aip.html">bu yolu</a>
<a href="http://www3.interscience.wiley.com/journalfinder.html"
<td width="46%"><a href="/xpl/periodicals.jsp" class="dropDownNav" accesskey="j">Journals & Magazines
<td><a href="http://www.ieee.org/products/onlinepubs/resources/XploreTutorial.pdf" class="dropDownNav">IEEE Xplore Demo</a></td>
| <a href="/xpl/tocalerts_signup.jsp" class="topUnderlineLinks">Alerts</a>
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
<a href="/search/srchabstract.jsp?arnumber=1554748&isnumber=33079&punumber=10417&k2dockey=1554748@ieeecnfs&query=%28+grammatical+evolution%3Cin%3Eti+%29&pos=9" class="bodyCopy">Abstract</a>
<td><a href="history.jsp">View Session History</a></td>
<td><a href="advsearch.jsp">New Search</a></td>
<a href="http://web5s.silverplatter.com/webspirs/start.ws?customer=kaynak"
<a href="standartlar.html#tse">Türk Standartlarý</a>
<a href="http://isiknowledge.com" target="_parent">Web of Science</a>
<a href='deneme.html#bg'>Butler Group </a>veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. <span class="tarih">(19.03.2007)</span>
<a href='deneme.html#ps'>Productscan</a> veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. <span class="tarih">(19.03.2007)</span>
I am supposed to match text like this
href="history.jsp"
That is I need to match any href in the above text.
Now according to Solutions, it seems like the answer for this is href=(['"]).+?\1
But that last backreference, if I don't use it and repeat the regex group(I hope parenthesis is called group, correct me if I am wrong), why am I getting different results? That is if I use this I am getting wrong results. href=(['"]).+?['"]
or href=(['"]).+?(['"])
The backreference has to match the same thing that the capture group matched. So the first regexp will match
"abcd"
or
'abcd'
The second version doesn't link the two ends of the match, so it will match the following as well:
"abcd'
or
'abcd"
So the version with the back-reference only matches a string surrounded by the same types of quotes.
This difference is important if you have embedded quotes in a string, e.g.
some text "<div id='foo'>" more text
The version with the back-reference will match "<div id='foo'>"
, but the version without the back-reference will match "<div id='
.