Search code examples

how to not select somedata using xpath in webharvest

I am using webharvest with xquery to get a data from a website.

I have the 2 xquery variables with the following data


<p> <strong>Psoria-Shield Inc.</strong> (<a href=""></a><a href="/Tracker?data=gB90UgQvS9bs99znBBkklh-mudx4NTcPFIy_wiP7zUJ-qBXYABNid0GYgW4g7qVsjn3_dv2FPGzaYgKnhq_Ujg%3D%3D" target="_top"></a>) is a Tampa FL based company specializing in design, manufacturing, and distribution of medical devices to domestic and international
                  markets. PSI employs full-time engineering, production, sales staff, and manufactures within an ISO 13485 certified quality
                  system. PSI's flagship product, Psoria-Light&#174;, is FDA-cleared and CE marked and delivers targeted UV phototherapy for
                  the treatment of certain skin disorders. Psoria-Shield Inc., was acquired by Wellness Center USA Inc. ("WCUI") in August 2012,
                  and is now a wholly-owned subsidiary.
               <p> <strong>AminoFactory</strong> (<a href=""></a><a href="/Tracker?data=O0xbFRJiVuWDzRDq7SVwVR9xAPYLIGQyBw4mDziUrH4KB3DIYUasiO_O78eteJsv2doAGtg4kRhAqmnvkQ-9LA%3D%3D" target="_top"></a>), a division of Wellness Center USA, Inc., is an online supplement store that markets and sells a wide range of high-quality
                  nutritional vitamins and supplements. By utilizing AminoFactory's online catalog, bodybuilders, athletes, and health conscious
                  consumers can choose and purchase the highest quality nutritional products from a wide array of offerings in just a few clicks.
                <pre>At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top"></a> Investor Relations Contact:
Arthur Douglas &amp; Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top"></a></pre> </span><span class="dt-green">

and $contact:

At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top"></a> Investor Relations Contact:
Arthur Douglas &amp; Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top"></a>

(This above text is just a example.)

What I want to so is remove the content of $contact from $text so far I have come up with the following code:

    for $x in $text
        return if(matches($contact, '')) then $x
            else if(matches($contact, $x)) then  '' else $x 

It is not working. I dont know where I am going wrong. Please let me know the right way of doing this.


  • Do not use matches(...) for exact string comparison, it is made for regular expressions and you'd need to escape a bunch of special characters.

    If the HTML subtree is the exact same, use this:

    $text[not(deep-equal(., <pre>{ $contact }</pre>))]

    If you only want to compare its contents, use data(...):

    $text[not(data(.) = string-join(data($contact)))]

    But given the data you posted, you'd be fine just removing all <pre/> nodes:

    $text[local-name() != 'pre']