Search code examples
pythonjenkinsattributeerrorjenkins-clihtml-parser

Parse HTML file using Python without external module


I am trying to Parse a html file using Python without using any external module. The reason is I am triggering a jenkins job and running into some import issues with lxml and BeautifulSoup (tried resolving it and I think somewhere I am doing over engineering to get my stuff done)

Input:

    <tr class="test">
    <td class="test">
      <a href="a.html">BA</a>
    </td>
    <td class="duration">
      0.000s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="test">
    <td class="test">
      <a href="o.html">Aa</a>
    </td>
    <td class="duration">
      0.000s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="test">
    <td class="test">
      <a href="g.html">VideoAds</a>
    </td>
    <td class="duration">
      0.390s
    </td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

        <td class="zero number">0</td>

    <td class="passRate">
            N/A
          </td>
  </tr>

  <tr class="suite">
    <td colspan="2" class="totalLabel">Total</td>

        <td class="zero number">271</td>

        <td class="zero number">0</td>

        <td class="zero number">3</td>

    <td class="passRate suite">
            98%
          </td>

  </tr>

Output:

I want to take that specific block of tr tag with the class "suite" (check at the end) and then pull the values for Zero number, Zero number, Zero number and passRate suite. Finally, print the values.

~~~~~~~~~~~~~~~~~~~~~~~~~~

Eg. Zero number = 271 ...

Pass rate = 98%

~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is what I tried with lxml:

tree = parse(HTML_FILE)
tds = tree.xpath("//tr[@class='suite']//td/text()")
val = map(str.strip, tds)

This works out locally but I really want to do something without any external dependencies. Shall I use strip() or open a file using os.path.isFile(). I may not be correct but advise/walk me through what would be solution to do this.


Solution

  • For one element you could try to use re module or even string functions.

    data = '''<tr class="test">
    <td class="test">
    <a href="no.html">track</a></td>
    <td class="duration">0.390s</td>
    <td class="zero number">0</td>
    <td class="zero number">0</td>
    <td class="zero number">0</td>
    <td class="passRate">N/A</td></tr>
    
    <tr class="suite">
    <td colspan="2" class="totalLabel">Total</td>
    <td class="passed number">271</td>
    <td class="zero number">0</td>
    <td class="failed number">3</td>
    <td class="passRate suite">98%</td>
    </tr>'''
    
    # re module
    
    import re
    
    print(re.search('suite">(\d+)%', data).group(1))
    
    # string functions
    
    before = 'passRate suite">'
    after  = '%'
    start = data.find(before) + len(before)
    stop  = data.find(after, start)
    
    print(data[start:stop])
    

    EDIT: to get othere values with re

    import re
    
    print('passed:', re.search('passed number">(\d+)', data).group(1))
    print('zero:', re.search('zero number">(\d+)', data).group(1))
    print('failed:', re.search('zero number">(\d+)', data).group(1))
    print('Rate:', re.search('suite">(\d+)', data).group(1))
    
    passed: 271
    zero: 0
    failed: 0
    Rate: 98