Search code examples
python-3.xbeautifulsouptokenize

soup: extract all paragraphs with a specific class excluding those that are in tables


I have a messy old MCQ word document that I converted to HTML to extract the MCQ in a beautiful manner to make it useful & Easy to create a Microsoft forms.

The question sets that I want to extract MCQ from could be obtained here.

Now what I want is to convert this file to look something like so (here)

enter image description here

I wrote the following code to extract the paragraphs I need, but it is also extracting the paragraphs from the tables which is not useful to create list for question and list for potential solutions to each question. My code is as follow for now:

from bs4 import BeautifulSoup
import os
from nltk.tokenize import RegexpTokenizer

# Read .docx file in the CWD
file=[x for x in os.listdir() if '.htm' in x][0]

# Create a soup to parse information
soup = BeautifulSoup(open(file), "html.parser")

# Find all paragraph elements that contains required information
results = soup.find_all("p", class_="MsoNormal")

# Check number of words
tokenizer = RegexpTokenizer(r'\w+')

# Extract questions
Extract_questions=[x.text for x in results if len(tokenizer.tokenize(x.text))>1]

May you please help me to create the required docx file that I want? I really do not know where to start.


Solution

  • This is by no means complete code but you it can give you a start:

    import pandas as pd
    from itertools import groupby
    from bs4 import BeautifulSoup
    from textwrap import wrap
    
    
    with open("page.html", "r") as f_in:
        soup = BeautifulSoup(f_in.read(), "html.parser")
    
    results = soup.select("body > div > .MsoNormal, body > div > .MsoNormalTable")
    
    groups = [group := []]
    for r in results:
        if r.text.startswith("Question "):
            groups.append(group := [r])
        else:
            group.append(r)
    
    for g in groups:
        for p in g:
            if p["class"] == ["MsoNormalTable"]:
                df = pd.read_html(str(p))[0].fillna("")
                print()
                print(df.to_csv(index=False, header=None, sep="\t"))
            else:
                t = p.get_text(strip=True).replace("\n", " ").strip()
                if (
                    t
                    and "Question " not in t
                    and "L1EC" not in t
                    and "Lesson " not in t
                ):
                    print("\n".join(wrap(t, 70)))
        print("-" * 80)
    

    Prints:

    --------------------------------------------------------------------------------
    The price of ABC Financial News is increased from $2.00 to $2.50; this
    leads to an increase in the sales of a competing financial
    magazine, XYZ Finance, which now sells 120,000 copies a week, up from
    100,000 copies a week. The cross-price elasticity of demand is closest
    to:
    
            0.8
            1.22
            1.25
    
    --------------------------------------------------------------------------------
    The following table lists the market shares of three major firms in an
    industry. The industry's three-firm Herfindahl-Hirschman Index
    is closest to:
    
    Firms   Market Share
    X       20%
    Y       30%
    Z       10%
    
    
            0.14
            0.33
            0.6
    
    --------------------------------------------------------------------------------
    Over a period of 1 year, a country’s real GDP increases from $168
    billion to $179 billion, and the GDP deflator increases from 115 to
    122.
    The increase in the country’s nominal GDP over the year is closest to:
    
            6.55%
            13.03%
            4.34%
    
    --------------------------------------------------------------------------------
    Consider the following statements:
    Statement 1: A government is said to have a trade deficit if its
    expenditure exceeds net taxes.
    Statement 2: An economy must finance a trade deficit by borrowing from
    the rest of the world.
    Which of the following is most likely?
    
            Only Statement 1 is incorrect.
            Only Statement 2 is incorrect.
            Both statements are correct.
    
    --------------------------------------------------------------------------------