Search code examples
pythonweb-scrapingbeautifulsouphtml-parsing

Select multiple elements with BeautifulSoup and manage them individually


I am using BeautifulSoup to parse a webpage of poetry. The poetry is separated into h3 for poem title, and .line for each line of the poem. I can get both elements and add them to a list. But I want to manipulate the h3 to be uppercase and indicate a line break, then insert it into the lines list.

    linesArr = []
    for lines in full_text:
        booktitles = lines.select('h3')
        for booktitle in booktitles:
            linesArr.append(booktitle.text.upper())
            linesArr.append('')
        for line in lines.select('h3, .line'):
            linesArr.append(line.text)

This code appends all book titles to the beginning of the list, then continues getting the h3 and .line items. I have tried inserting code like this:

    linesArr = []
    for lines in full_text:
        for line in lines.select('h3, .line'):
            if line.find('h3'):
                linesArr.append(line.text.upper())
                linesArr.append('')
            else:
                linesArr.append(line.text)

Solution

  • I'm not sure of what you are trying to do, but here with this way you can get an array with the title in upper case and all your line:

    #!/usr/bin/python3
    # coding: utf8
    
    from bs4 import BeautifulSoup
    import requests
    
    page = requests.get("https://quod.lib.umich.edu/c/cme/CT/1:1?rgn=div2;view=fulltext")
    soup = BeautifulSoup(page.text, 'html.parser')
    
    title = soup.find('h3')
    full_lines = soup.find_all('div',{'class':'line'})
    
    linesArr = []
    linesArr.append(title.get_text().upper())
    for line in full_lines:
        linesArr.append(line.get_text())
    
    # Print full array with the title and text
    print(linesArr)
    
    # Print text here with line break
    for linea in linesArr:
        print(linea + '\n')