Search code examples
pythonpython-3.xbeautifulsouppython-requestspython-requests-html

Python Request HTML for BeautifulSoup Parsing


I have no idea it's working only when I right-click, copy the entire body as HTML into a file and parse it, but when I access it directly from the link via request, I get 0 results.

Ex.

Sample HTML

bsoup.py

When I use these codes

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

cards = soup.select('#__nuxt [data-test="UpCLineClampV2"]')

for card in cards:
    for strong in card.findChildren('strong', recursive=True):
        print('================================================')
        print(strong.next_sibling)

Result: ✅Working, I can see 3 cover letters

➜  web_scrap git:(master) ✗ python3 bsoup.py
================================================

    hi!
I have experience with sending commands to Alexa.
The thing is that you'll need to send audio file (not text) to Alexa service.
Next thing is to decide what to do with response, which can have a lot of options: audio, playlist, audio question, etc.
But it is possible

================================================

    Answer my questions please

================================================

    I am an Expert in Alexa development

bsoupv2.py

I've tried to use a request to a link instead.

URL Link updated

from bs4 import BeautifulSoup
import requests

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

url = "https://www.bunlongheng.com/raw/NGIwMTUzOTQtMTFlNS00NjYzLTk3MjEtYmU0Yzg5OGU1Yzc4"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')

soup = soup.find(['body'])
cards = soup.select('#__nuxt [data-test="UpCLineClampV2"]')

for card in cards:
    for strong in card.findChildren('strong', recursive=True):
        print('================================================')
        print(strong.next_sibling)

Result: ❌ Not Working, I can see 0 cover letter, nothing!

➜  web_scrap git:(master) ✗ python3 bsoupv2.py
➜  web_scrap git:(master) ✗

Solution

  • Try to html.unescape the response from the www.bulongheng.com:

    import html
    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Methods": "GET",
        "Access-Control-Allow-Headers": "Content-Type",
        "Access-Control-Max-Age": "3600",
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
    }
    
    
    url = "https://www.bunlongheng.com/raw/NGIwMTUzOTQtMTFlNS00NjYzLTk3MjEtYmU0Yzg5OGU1Yzc4"
    req = requests.get(url, headers)
    soup = BeautifulSoup(html.unescape(req.text), "html.parser")
    
    cards = soup.select('#__nuxt [data-test="UpCLineClampV2"]')
    
    for card in cards:
        for strong in card.findChildren("strong", recursive=True):
            print("================================================")
            print(strong.next_sibling)
    

    Prints:

    ================================================
    
        hi!
    I have experience with sending commands to Alexa.
    The thing is that you'll need to send audio file (not text) to Alexa service.
    Next thing is to decide what to do with response, which can have a lot of options: audio, playlist, audio question, etc.
    But it is possible
      
    ================================================
    
        Answer my questions please
      
    ================================================
    
        I am an Expert in Alexa development
      
    ================================================
    
        hi!
    I have experience with sending commands to Alexa.
    The thing is that you'll need to send audio file (not text) to Alexa service.
    Next thing is to decide what to do with response, which can have a lot of options: audio, playlist, audio question, etc.
    But it is possible
      
    ================================================
    
        Answer my questions please
      
    ================================================
    
        I am an Expert in Alexa development