Search code examples
firefox-addonweb-scrapingimacros

iOpus iMacros script for simple URL scrape


Trying to exctract a link url with iMacros for Firefox plugin.

The following input html code is on the website to be scraped: a link url and a description

<div class="subcl">
    <a href="http://www.url.com/someurl.html" target="_blank">description</a>
</div>

Desired output from iMacros: simply the link url

http://www.url.com/someurl.html

Since there are further links on the website the class="subcl" should be included in the code. Maybe there is a way to implement a nested structure? I would prefer - if possible - non Javascript code since I don't code in it myself.

The following macro code didn't work

VERSION BUILD=8300326 RECORDER=FX
TAB T=1

'Open the website
URL GOTO=http://www.url.com/pagetobescraped.html

'Extract the link url on the page
TAG POS=1 TYPE=DIV ATTR=CLASS:subcl* EXTRACT=HREF

The macro returns #EANF# (end of file reached with no match). When I replace EXTRACT=HREF with EXTRACT=TXT it returns "description" but I need the URL.


Edit

to clarify symbiotech's answer: the input html code is preceeded by the elements <h1> as well as <p>. All togehter it looks like this ...

<h1>Title of the page</h1><p class="intro"></p>

<div class="subcl">
    <a href="http://www.url.com/someurl.html" target="_blank">description</a>
</div>

Solution

  • You need to extract the href from an a elem, not from the div itself. Also since you say there're other links on the page you need to take as a reference point every "sublc" div, hence the POS=R1:

    TAG POS=1 TYPE=DIV ATTR=CLASS:subcl*
    TAG POS=R1 TYPE=A ATTR=TXT:* EXTRACT=HREF
    

    If you need multiple links extracted instead use "Play Loop" button with this:

    TAG POS={{!LOOP}} TYPE=DIV ATTR=CLASS:subcl*
    TAG POS=R1 TYPE=A ATTR=TXT:* EXTRACT=HREF
    

    EDIT for your specific case: You need to position yourself above the elements you want to extract, but on the same tree level in order to use relative positioning proper. That empty p elem seems a good enough anchor or you could use the h1 element, if its text doesn't change too much:

    TAG POS=1 TYPE=P ATTR=CLASS:intro
    TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:* EXTRACT=TXT
    TAG POS=1 TYPE=P ATTR=CLASS:intro
    TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:* EXTRACT=HREF