Search code examples
htmlweb-scrapingimacros

How can I save all href values from list items to a text file with iMacros?


I am a newbie to imacros but have version 12.0.501.6698 installed Windows 10 Pro 20H2 - OS Build 19042.1110 .

I am trying to extract from the html of a page all the href values located in multiple list items. Then save those URLs into a text file.

The number of list items can be different so I cannot use a loop of a known number of iterations; I have to grab all the list items and extract the href attribute values.

Example of the format of the html code

<ul class="bullet-list columns-2 columns--regular">
<li><a href="/search/agents/results.htm?location=ampthill" >Estate Agents in Ampthill</a></li>
<li><a href="/search/agents/results.htm?location=barton_le_clay" >Estate Agents in Barton-Le-Clay</a></li>
<li><a href="/search/agents/results.htm?location=bedford" >Estate Agents in Bedford</a></li>
<li><a href="/search/agents/results.htm?location=biggleswade" >Estate Agents in Biggleswade</a></li>
<li><a href="/search/agents/results.htm?location=bromham" >Estate Agents in Bromham</a></li>
<li><a href="/search/agents/results.htm?location=clapham_beds" >Estate Agents in Clapham</a></li>
</ul>

I have looked at the code in similar articles such as - iMacros: Extract ID attribute from a ul li list

This is the code I have tried in imacros.

VERSION BUILD=12.0.501.6698
TAB T=1
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
TAB CLOSEALLOTHERS
'SET !PLAYBACKDELAY 0.00
URL GOTO=https://www.home.co.uk/search/agents/?county=beds

TAG POS=1 TYPE=UL ATTR=ID:bullet-list EXTRACT=LI
TAG POS=R{{!LOOP}} TYPE=A ATTR=ID:* EXTRACT=HREF
SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt

I get an error Box

enter image description here

I have also tried modifying all the permutations for the Values of TYPE and what to EXTRACT.

The text file that should store the URLs from the href attributes in this line:

<li><a href="/search/agents/results.htm?location=clapham_beds" >Estate Agents in Clapham</a></li>

just contains a line #EANF# not "/search/agents/results.htm?location=clapham_beds"


Solution

  • Parallel Thread on the iMacros Forum (Opened by me..., as I don't really trust this Site for "Continuity"...):

    [Time spent writing this Answer: About [10] Hours...!]
    => SO Posting: ~2h approx, writing Scripts and Testing: ... the rest, ah-ah...!
    (And first time ever I spend so much time on an Answer, annoying to have to constantly "fight" against the "Design" of the Site...)


    Addressing all your different Qt's (Questions) more or less in reverse order, + posting 2 (or 3 actually) different Solutions/Implementations in "minimalistic" Implementations/Scripts, I will mention several Concepts/Techniques that I won't explain (in depth) or I'd need to quote half of the Wiki and/or of the iMacros Forum...
    (Terms I enclose between Single Quotes ('') or Backticks (``) are such Terms...)


    Warning Popup about "Loop" and "Play":
    Well, read the Msg on that Popup, it looks pretty clear and self-explanatory to me...
    (Well, apart from the ugly Typo in it, of course...!)


    Getting #EANF# in the EXTRACT and SAVEAS:
    => Yep, normal, that's because you are using the UL Element as 'Anchor' for 'Relative Positioning', you would need to use 'Double Relative Positioning' in this Case as the UL Element is actually the 'Container' for all LI + A Elements...
    (More Explanation on the Forum, where I've explained the Concept/Technique many times already...)


    The number of list items can be different so I cannot use a loop of a known number of iterations...
    (Emphasis mine...)

    Hum, well..., this is not really true, this would actually be the "easiest" Implementation in my Opinion..., as that will have the advantage that you can then let SAVEAS take care of saving each Link on a separate/new Row for each Loop, (or you'll need to add/implement yourself a Mechanism for that Func...), and you can simply let iMacros abort your Script "naturally" if an Element is "not found", like when there is no new Link to extract...
    ... And I will use 2 different Mechanisms for that part...

    (All Scripts written and tested in iMacros for FF v8.8.2, PM v26.3.3, Win10_PRO_x64_21H1.)


    Implementation 1: Looping + Abort on Not Found:

    And that will give stg like:

    VERSION BUILD=8820413 RECORDER=FX
    TAB T=1
    
    SET Search_Keyword "Estate Agents"
    
    'Debug:
    'SET !LOOP 15
    
    'URL GOTO=https://www.home.co.uk/search/agents/?county=beds
    
    'Extract Links using 'Relative Positioning':
    'TAG POS=1 TYPE=H1 ATTR=TXT:Estate<SP>Agents<SP>in<SP>Bedfordshire  //  (Recorded)
    TAG POS=1 TYPE=H1 ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
    TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=HREF
    '>
    'Debug:
    'PROMPT {{!EXTRACT}}
    
    'Save Link to '.CSV' (or '.TXT'):
    SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt
    'SAVEAS TYPE=EXTRACT FOLDER=* FILE=SOF_MSB.txt
    
    'Abort Script if no more Link(s) to extract:
    SET !TIMEOUT_STEP 1
    TAG POS=R1 TYPE=LI ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
    

    Yep, OK, this one works already...
    21 Links on the Page with URL provided, I looped the Script 30x times, and it aborts by itself at the end of Loop=21...!

    • Notice, I don't use !ERRORIGNORE, and the Abort Func actually relies on that...
    • While extracting and looping on the "Links" (the A Elements), I "switched back" to an LI Element for the 2nd R-POS to abort the Script, as if I had used also the next Link, the EXTRACT Command never aborts a Script (by Design), it will simply return #EANF# if the Element is not found, and without the EXTRACT, the Script would then click on and follow the Links for all previous Loops.
    • And !EXTRACT_TEST_POPUP can be omitted when looping a Script...
    • Works "best" with the Page already loaded once "manually", or reloading the Page on every Loop will slow the execution... If the Page "really" needs to be loaded from the Script, it's possible to add a Mechanism for a 'Conditional URL GOTO' (another "Concept/Technique" to search the iMacros Forum for, ah-ah...!) for loading the Page only for Loop=1...

    Implementation 2: Looping + Abort with MacroError() + Report:

    Alright..., and this one would be my "Favorite"...!:
    Same like Script_1 but can be applied to an A Element to abort the Script and using MacroError() allows to display some mini-Report in the iMacros Side-Panel like for example:

    VERSION BUILD=8820413 RECORDER=FX
    TAB T=1
    
    SET Search_Keyword "Estate Agents"
    
    'Debug:
    'SET !LOOP 15
    
    'URL GOTO=https://www.home.co.uk/search/agents/?county=beds
    
    'Extract Links using 'Relative Positioning':
    'TAG POS=1 TYPE=H1 ATTR=TXT:Estate<SP>Agents<SP>in<SP>Bedfordshire  //  (Recorded)
    TAG POS=1 TYPE=H1 ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=TXT
    SET Title {{!EXTRACT}}
    SET !EXTRACT NULL
    TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=HREF
    '>
    'Debug:
    'PROMPT {{!EXTRACT}}
    
    'Save Link to '.CSV' (or '.TXT'):
    'SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt
    SAVEAS TYPE=EXTRACT FOLDER=* FILE=SOF_MSB.txt
    
    'Abort Script if no more Link(s) to extract:
    SET !TIMEOUT_STEP 1
    SET !EXTRACT NULL
    'TAG POS=R1 TYPE=LI ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
    TAG POS=R1 TYPE=A ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=TXT
    
    'Prepare mini-Report:
    SET Report {{!LOOP}}<SP>Links<SP>extracted<SP>for:<BR>{{Title}}
    SET Summary (No<SP>Error...!!)<SP>({{!NOW:yyyy-mm-dd<SP>hhhnn}})<BR><BR>{{Report}}<BR><BR>
    
    SET !ERRORIGNORE NO
    SET Abort_Report EVAL("var s='{{!EXTRACT}}'; if(s=='#EANF#'){MacroError(\"{{Summary}}\");}")
    SET !ERRORIGNORE YES
    

    Like Script_1, => looped 30 or 50 times, and which will display:

    MacroError: (No Error...!!) (2021-07-22 15h57)
    
    21 Links extracted for:
    Estate Agents in Bedfordshire
    
    , line 36 (Error code: -1340)
    

    (The Content of the mini-Report can be customized of course..., and can also be saved to some separate '.log' File...)


    Implementation 3: Extract all LI Elements with 1 EXTRACT from the Containing UL Element:

    This one is a "quick and dirty" Demo, as I find it a bit of a cumbersome Implementation, but here you go...:

    VERSION BUILD=8820413 RECORDER=FX
    TAB T=1
    
    SET Search_Keyword "Estate Agents"
    
    URL GOTO=https://www.home.co.uk/search/agents/?county=beds
    
    'TAG POS=1 TYPE=LI ATTR=TXT:Estate<SP>Agents<SP>in<SP>Ampthill
    'TAG POS=1 TYPE=LI ATTR=TXT:Estate<SP>Agents<SP>in<SP>Barton-Le-Clay
    'TAG POS=1 TYPE=DIV ATTR=TXT:Estate<SP>agent<SP>listings<SP>are<SP>available<SP>for<SP>th* EXTRACT=HTM
    
    'TAG POS=1 TYPE=P ATTR=TXT:Estate<SP>agent<SP>listings<SP>are<SP>available*
    'TAG POS=R1 TYPE=UL ATTR=* EXTRACT=HTM
    
    'Hum, can better use the 'H1' Element as Anchor...:
    'TAG POS=1 TYPE=H1 ATTR=TXT:Estate<SP>Agents<SP>in<SP>Bedfordshire  //  (Recorded)
    TAG POS=1 TYPE=H1 ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
    TAG POS=R1 TYPE=UL ATTR=* EXTRACT=HTM
    
    SET Results_HREF EVAL("var s='{{!EXTRACT}}'; var w,x,y,z; w=s.split('regular\">')[1]; x=w.split('\"'); y=x[1]+','+x[3]+','+x[5]; z=y.split(',').join('\\r\\n'); z;")
    '>
    'Debug:
    PROMPT Results:<BR><BR>_{{Results_HREF}}_
    
    'Not really finished... (Quick and dirty Demo...)
    
    'Save Links to '.CSV' (or '.TXT':
    'SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt
    
    '>>>
    
    'Extracted:
    '<ul style="outline: 1px solid blue;" class="bullet-list columns-2 columns--regular"> 
    '<li style="outline: 1px solid blue;"><a href="/search/agents/results.htm?location=ampthill">Estate Agents in Ampthill</a></li> 
    '<li style="outline: 1px solid blue;"><a href="/search/agents/results.htm?location=barton_le_clay">Estate Agents in Barton-Le-Clay</a></li> 
    '<li><a href="/search/agents/results.htm?location=bedford">Estate Agents in Bedford</a></li> 
    '<li><a href="/search/agents/results.htm?location=biggleswade">Estate Agents in Biggleswade</a></li> 
    '<li><a href="/search/agents/results.htm?location=bromham">Estate Agents in Bromham</a></li> 
    '<li><a href="/search/agents/results.htm?location=clapham_beds">Estate Agents in Clapham</a></li> <li><a href="/search/agents/results.htm?location=dunstable">Estate Agents in Dunstable</a></li> <li><a href="/search/agents/results.htm?location=flitwick">Estate Agents in Flitwick</a></li> <li><a href="/search/agents/results.htm?location=harlington">Estate Agents in Harlington</a></li> <li><a href="/search/agents/results.htm?location=henlow">Estate Agents in Henlow</a></li> <li><a href="/search/agents/results.htm?location=houghton_regis">Estate Agents in Houghton Regis</a></li> <li><a href="/search/agents/results.htm?location=kempston">Estate Agents in Kempston</a></li> <li><a href="/search/agents/results.htm?location=langford">Estate Agents in Langford</a></li> <li><a href="/search/agents/results.htm?location=leighton_buzzard">Estate Agents in Leighton Buzzard</a></li> <li><a href="/search/agents/results.htm?location=linslade">Estate Agents in Linslade</a></li> <li><a href="/search/agents/results.htm?location=luton">Estate Agents in Luton</a></li> <li><a href="/search/agents/results.htm?location=potton">Estate Agents in Potton</a></li> <li><a href="/search/agents/results.htm?location=sandy">Estate Agents in Sandy</a></li> <li><a href="/search/agents/results.htm?location=shefford">Estate Agents in Shefford</a></li> <li><a href="/search/agents/results.htm?location=stotfold">Estate Agents in Stotfold</a></li> 
    '<li><a href="/search/agents/results.htm?location=toddington">Estate Agents in Toddington</a></li> </ul>
    

    About the Script, it's a "quick and dirty" Demo about the y part, only demonstrating for the first 3 Links...
    Neater would be to use a for Loop until x.length/2 (Incr=2), with Array.push(), where recreating the y String/Array will need to be "hard-coded" 30 or 50 times...
    This Technique can be useful for a "short" list (<10), or a known fixed number of items to extract (on one Run or per Loop).

    => See Content of the Debug PROMPT...

    (And the Script needs to be run only x1 time, => with the 'Play' Button, not with the 'Loop' Button.)

    Hum and I should mention that for this Implementation, it is actually "recommended" to "fresh"-load (or reload) the Page (=> with URL GOTO that I have (re)activated in this Script), and certainly to not "play" with iMacros on that Page before "the" Script will run, or iMacros (Recording or Replay) will inject some Styling in the HTML Structure of the Page, => like visible in my "Extracted:" Section with style="outline: 1px solid blue;" for the UL Element and the first 2 LI Elements...
    "Problem" is that this style extra-Attribute contains (2) Double Quotes ("...") each time, but I actually based one of the split() Statements in the EVAL() on this very Double Quote Char (") to isolate the HREF Values, or the x[1]/x[3]/x[5]/etc will shift to higher Values for the (Start)Index for the Array... And the Increment would also change together...