Trying to exctract a link url with iMacros for Firefox plugin.
The following input html code is on the website to be scraped: a link url and a description
<div class="subcl">
<a href="http://www.url.com/someurl.html" target="_blank">description</a>
</div>
Desired output from iMacros: simply the link url
http://www.url.com/someurl.html
Since there are further links on the website the class="subcl" should be included in the code. Maybe there is a way to implement a nested structure? I would prefer - if possible - non Javascript code since I don't code in it myself.
The following macro code didn't work
VERSION BUILD=8300326 RECORDER=FX
TAB T=1
'Open the website
URL GOTO=http://www.url.com/pagetobescraped.html
'Extract the link url on the page
TAG POS=1 TYPE=DIV ATTR=CLASS:subcl* EXTRACT=HREF
The macro returns #EANF#
(end of file reached with no match). When I replace EXTRACT=HREF
with EXTRACT=TXT
it returns "description"
but I need the URL.
Edit
to clarify symbiotech's answer: the input html code is preceeded by the elements <h1>
as well as <p>
. All togehter it looks like this ...
<h1>Title of the page</h1><p class="intro"></p>
<div class="subcl">
<a href="http://www.url.com/someurl.html" target="_blank">description</a>
</div>
You need to extract the href
from an a
elem, not from the div
itself. Also since you say there're other links on the page you need to take as a reference point every "sublc" div
, hence the POS=R1
:
TAG POS=1 TYPE=DIV ATTR=CLASS:subcl*
TAG POS=R1 TYPE=A ATTR=TXT:* EXTRACT=HREF
If you need multiple links extracted instead use "Play Loop" button with this:
TAG POS={{!LOOP}} TYPE=DIV ATTR=CLASS:subcl*
TAG POS=R1 TYPE=A ATTR=TXT:* EXTRACT=HREF
EDIT for your specific case:
You need to position yourself above the elements you want to extract, but on the same tree level in order to use relative positioning proper. That empty p
elem seems a good enough anchor or you could use the h1
element, if its text doesn't change too much:
TAG POS=1 TYPE=P ATTR=CLASS:intro
TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=P ATTR=CLASS:intro
TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:* EXTRACT=HREF