html python-3.x pandas beautifulsoup html-parsing

Sorting out HTML Table in Python (counting rows correctly)

I want to be able to count the number of Entries in the columns: Change, Status, Req._Type by value. For example, NEW OBJECT comes up two times.

The Change column has values: NEW OBJECT, OBJECT DELETED, Attribute "Object Text" Changed, Attribute "Object Heading" Changed

The Status column has values: In Review (or alternative made up values)

The Req._Type column has values: functional Req., Info., Überschrift (or alternative made-up values)

Tried solution (repl.it has a good online IDE):

#!/usr/bin/python

import re

from bs4 import BeautifulSoup

with open('Test2.html', 'rb') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    strings = soup.find_all(string=re.compile('NEW OBJECT'))
    strings2 = soup.find_all(string=re.compile('OBJECT DELETED'))
    strings3 = soup.find_all(string=re.compile('Attribute "Object Text" Changed'))
    strings4 = soup.find_all(string=re.compile('Attribute "Object Heading" Changed'))
    strings5 = soup.find_all(string=re.compile("Info."))
    strings6 = soup.find_all(string=re.compile("functional Req."))
    strings7 = soup.find_all(string=re.compile('Überschrift'))
    strings8 = soup.find_all(string=re.compile('In Review'))
    print(len(strings))
    print(len(strings2))
    print(len(strings3))
    print(len(strings4))
    print(len(strings5))
    print(len(strings6))
    print(len(strings7))
    print(len(strings8))
    #strings3 = soup.find_all(string=re.compile('Changed'))
    #print(strings3)

    #for txt in strings3:
        #print(' '.join(txt.split()))

    #for tag in soup.find_all('th'):
    #    print(f'{tag.name}: {tag.text}')
    
    for tag in soup.find_all('td'):
        new = f'{tag.text}'
        if(new.find('Info.') != -1):
            print ("Found!")
            #print(soup.select('b:nth-of-type(3)'))
        else:
            print ("Not found!")

Corresponding output:

2
1
1
0
3
1
0
3
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Found!
Not found!
Not found!
Found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!

The solution I tried is not dynamic and does not match the columns, but searches instead with find_all to match the expressions and count them, which is not optimal.

a)How to make it dynamic, so only the mentioned columns are considered and we get the counters for every value category of these three columns? In the given example, the "Info." value is falsely found three times, though it has to be found only two times which is the correct answer. This needs to be done for every value of the three columns.

b)How to output the counters for filters: NEW OBJECT & functional Req. (=0), NEW OBJECT & Info. (=1), OBJECT DELETED & functional Req. (=1)? Tried different things from here but couldn't get it working.

c)Optional question: The Status or alternatively Req._Type column can have different values, depending on the definition of the table. That means the values can change and are not fixed. Can we count these values (by filtering the unique values out in an array or list) and then count how many of each unique value is contained in the affected column.

Solution

I'm not sure I understand your second and third questions (and, as SO policy requires, you should post each question separately anyway), but here's how to approach the first one, and it may also help you with the rest.

import pandas as pd
ht = """[your html]"""
targets = ['Change', 'Status', 'Req._Type']
df = pd.read_html(ht)[1]

for target in targets:
    print(df[target].value_counts())
    print('---')

Output:

NEW OBJECT                         2
Attribute "Object Text" Changed    1
OBJECT DELETED                     1
Name: Change, dtype: int64
---
In Review    3
Name: Status, dtype: int64
---
Info.              2
functional Req.    1
Name: Req._Type, dtype: int64