Search code examples
pythonweb-scrapinglxml

Get data between two tags in Python


<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>

Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set

I tried using lxml

parser = etree.HTMLParser()
tree   = etree.parse(StringIO.StringIO(html), parser)                   
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)              
print rawResponse

and getting the following output

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

Solution

  • You could use the text_content method:

    import lxml.html as LH
    
    html = '''<h3>
    <a href="article.jsp?tp=&arnumber=16">
    Granular computing based
    <span class="snippet">data</span>
    <span class="snippet">mining</span>
    in the views of rough set and fuzzy set
    </a>
    </h3>'''
    
    root = LH.fromstring(html)
    for elt in root.xpath('//a'):
        print(elt.text_content())
    

    yields

    Granular computing based
    data
    mining
    in the views of rough set and fuzzy set
    

    or, to remove whitespace, you could use

    print(' '.join(elt.text_content().split()))
    

    to obtain

    Granular computing based data mining in the views of rough set and fuzzy set
    

    Here is another option which you might find useful:

    print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))
    

    yields

    Granular computing based data  mining in the views of rough set and fuzzy set
    

    (Note it leaves an extra space between data and mining however.)

    '//a/descendant-or-self::text()' is a more generalized version of "//a/child::text() | //a/span/child::text()". It will iterate through all children and grandchildren, etc.