Search code examples
web-scrapingxpathscrapycss-selectors

How to Extract the Name 'Terence Crawford' from an HTML Segment, Excluding the Span Element?


I am currently facing difficulty retrieving the name 'Terence Crawford' from an HTML segment. The challenge lies in excluding the span element, which is present within the same parent element.

<td colspan="3" style="position:relative;" class="defaultTitleAlign">
<h1 style="display:inline-block;margin-right:5px;line-height:30px;">
                        <span style="font-weight:bold;"><i class="fas fa-crown" style="color:#f6b501 !important;"></i></span>
                    "Terence Crawford"
    </h1>
<div style="width:100%;position:relative;margin-top:5px;">
</div>
</td>

I attempted to retrieve the name by specifying both the class attribute 'defaultTitleAlign' and the style attribute 'display:inline-block;margin-right:5px;line-height:30px;', but it only returns '/n' instead of the actual name. Even when targeting the entire content of the h1 element, the name is not being displayed.

In [9]: response.xpath("//td[@class='defaultTitleAlign']/h1/text()").get()
Out[9]: '\n                        '

Solution

  • You can use the getall() method to collect all of the text() from the given selector, then you can will find the section you are looking for in the returned list.

    For example:

    In [1]: from scrapy.selector import Selector
    
    In [2]: html = """<td colspan="3" style="position:relative;" class="defaultTitleAlign">
       ...: <h1 style="display:inline-block;margin-right:5px;line-height:30px;">
       ...:                         <span style="font-weight:bold;"><i class="fas fa-crown" style="color:#f6b501 !important;"></i></span>
       ...:                     "Terence Crawford"
       ...:     </h1>
       ...: <div style="width:100%;position:relative;margin-top:5px;">
       ...: </div>
       ...: </td>"""
    
    In [4]: response = Selector(text=html)
    
    In [5]: text_list = response.xpath("//td[@class='defaultTitleAlign']/h1//text()").getall()
    
    In [6]: text = text_list[1].strip()
    
    In [7]: text
    Out[7]: '"Terence Crawford"'