Search code examples
pythonweb-scrapingpython-requestslxmlpython-requests-html

LXML/Requests-HTML: Get element after plain text, both children of same element?


I'm new to web scraping, and I'm currently using requests, requests-html, and lxml.

I'm having trouble figuring out how to target the "DESIRED INFO" from the specific <span> element in the following circumstance (a lot of info is thrown into a single <td>):

Note: no attributes differentiate the span elements, so I need to go by the plain text within the <td> that occurs right before the span.

Note 2: there are no element wrappers (e.g. <p>, <b>, etc.) around that plain text (OTHER TEXT, CONSISTENT TEXT, etc.)... they are just plain html text, immediate "children" of <td>

<td>
  OTHER TEXT
  <span>...</span>
  ...
  ...
  OTHER TEXT 2
  <span>...</span>
  ...
  ...
  CONSISTENT TEXT:
  <span>DESIRED INFO</span>
  ...
  ...
  OTHER TEXT 3
  <span>...</span>
  ...
  ...
  OTHER TEXT 4
  <span>...</span>
  ...
  ...
</td>

What I'm currently doing is looking for all the different possibilities that could exist in the DESIRED INFO spot, and grabbing the span element based on that, but that is insufficient because some of those <span> elements after the OTHER TEXT can contain the same contents.

What I'm currently doing (insufficient):

  spanDesiredInfoList = []
  for el in tree.xpath('//span[text()="POSSIBILITY 1"]'):
    spanDesiredInfoList.append(el)
  for el in tree.xpath('//span[text()="POSSIBILITY 2"]'):
    spanDesiredInfoList.append(el)
  ...
  ...
  # attempt to handle final list and get the correct span (basically impossible)

Thank you for your help!


Solution

  • Since the indices in the text lists of the <td> tag and <span> tags are the same, when we find the desired substring, this will be the desired index of the element:

    from lxml import etree
    
    text = '''
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        </head>
        <body>
            <table>
                <tr>
                    <td>
                        OTHER TEXT
                        OTHER TEXT
                        <span>uuu</span>
                        OTHER TEXT 2
                        OTHER TEXT 2
                        <span>ttt</span>
                        OTHER TEXT
                        CONSISTENT TEXT:
                        OTHER TEXT
                        <span>DESIRED INFO</span>
                        OTHER TEXT 3
                        <span>uuu</span>
                        OTHER TEXT 4
                        <span>ttt</span>
                    </td>
                </tr>
            </table>
        </body>
    </html>
    '''
    
    
    html = etree.HTML(text)
    # Getting lists of texts.
    result_td = html.xpath('//td/text()')
    result = html.xpath('//span/text()')
    # We are looking for the desired substring.
    for i, el in enumerate(result_td):
        if el.find('CONSISTENT TEXT:') != -1:
            print(result[i])
    
    DESIRED INFO