soup: extract all paragraphs with a specific class excluding those that are in tables

I have a messy old MCQ word document that I converted to HTML to extract the MCQ in a beautiful manner to make it useful & Easy to create a Microsoft forms.

The question sets that I want to extract MCQ from could be obtained here.

Now what I want is to convert this file to look something like so (here)

I wrote the following code to extract the paragraphs I need, but it is also extracting the paragraphs from the tables which is not useful to create list for question and list for potential solutions to each question. My code is as follow for now:

from bs4 import BeautifulSoup
import os
from nltk.tokenize import RegexpTokenizer

# Read .docx file in the CWD
file=[x for x in os.listdir() if '.htm' in x][0]

# Create a soup to parse information
soup = BeautifulSoup(open(file), "html.parser")

# Find all paragraph elements that contains required information
results = soup.find_all("p", class_="MsoNormal")

# Check number of words
tokenizer = RegexpTokenizer(r'\w+')

# Extract questions
Extract_questions=[x.text for x in results if len(tokenizer.tokenize(x.text))>1]

May you please help me to create the required docx file that I want? I really do not know where to start.

Solution

This is by no means complete code but you it can give you a start:

import pandas as pd
from itertools import groupby
from bs4 import BeautifulSoup
from textwrap import wrap


with open("page.html", "r") as f_in:
    soup = BeautifulSoup(f_in.read(), "html.parser")

results = soup.select("body > div > .MsoNormal, body > div > .MsoNormalTable")

groups = [group := []]
for r in results:
    if r.text.startswith("Question "):
        groups.append(group := [r])
    else:
        group.append(r)

for g in groups:
    for p in g:
        if p["class"] == ["MsoNormalTable"]:
            df = pd.read_html(str(p))[0].fillna("")
            print()
            print(df.to_csv(index=False, header=None, sep="\t"))
        else:
            t = p.get_text(strip=True).replace("\n", " ").strip()
            if (
                t
                and "Question " not in t
                and "L1EC" not in t
                and "Lesson " not in t
            ):
                print("\n".join(wrap(t, 70)))
    print("-" * 80)

Prints:

--------------------------------------------------------------------------------
The price of ABC Financial News is increased from $2.00 to $2.50; this
leads to an increase in the sales of a competing financial
magazine, XYZ Finance, which now sells 120,000 copies a week, up from
100,000 copies a week. The cross-price elasticity of demand is closest
to:

        0.8
        1.22
        1.25

--------------------------------------------------------------------------------
The following table lists the market shares of three major firms in an
industry. The industry's three-firm Herfindahl-Hirschman Index
is closest to:

Firms   Market Share
X       20%
Y       30%
Z       10%


        0.14
        0.33
        0.6

--------------------------------------------------------------------------------
Over a period of 1 year, a country’s real GDP increases from $168
billion to $179 billion, and the GDP deflator increases from 115 to
122.
The increase in the country’s nominal GDP over the year is closest to:

        6.55%
        13.03%
        4.34%

--------------------------------------------------------------------------------
Consider the following statements:
Statement 1: A government is said to have a trade deficit if its
expenditure exceeds net taxes.
Statement 2: An economy must finance a trade deficit by borrowing from
the rest of the world.
Which of the following is most likely?

        Only Statement 1 is incorrect.
        Only Statement 2 is incorrect.
        Both statements are correct.

--------------------------------------------------------------------------------