Search code examples
pythonweb-scrapingbeautifulsoupdata-extraction

Comments are visible on the webpage, but the html object returned by BeautifulSoup did not contain the comment parts


I tried to extract the text content of comments from a web page using its URL link, and I used BeautifulSoup for scraping. The content of comments is visible on the page when I clicked the URL link, but the HTML object returned by BeautifulSoup did not contain these tags and texts.

I used BeautifulSoup with 'html.parser' to do the web scraping. I successfully extracted the number of likes/views/comments of the video in the given webpage, but the information of comment sections was not included in the HTML file. The browser I used was Chrome, and the system is Ubuntu 18.04.1 LTS.

This is the codes I used (in python):

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os

webpage_link = "https://www.airvuz.com/video/Majestic-Beast-Nanuk?id=59b2a56141ab4823e61ea901"

try:
    page = urlopen(webpage_link)
except urllib.error.HTTPError as err:  # webpage cannot be found
    print("ERROR! %s" %(webpage_link))

soup = BeautifulSoup(page, 'html.parser')

The expected result is the soup object contains all the content which is visible on the webpage especially the text content of comments (like "Not being there I enjoyed a lot seeing the life style of white bear. Thanks to the provider for such documentary." and "WOOOW... amazing..."); however, I could not find the corresponding nodes in the soup object. Any help would be appreciated!


Solution

  • The comments are generated by JavasSript via an ajax request. You can send the same request and get the comments from the json response. You can find the request using the network tab in the inspect tool.

    from urllib.request import urlopen
    from bs4 import BeautifulSoup, Comment
    import json
    webpage_link = "https://www.airvuz.com/api/comments/video/59b2a56141ab4823e61ea901?page=1&limit=20"
    page = urlopen(webpage_link).read()
    comments_json=data = json.loads(page)
    for comment_info in comments_json['data']:
        print(comment_info['comment'].strip()) 
    

    Output

    Not being there I enjoyed a lot seeing the life style of white bear. Thanks to the provider for  such documentary.
    WOOOW... amazing...
    I've been photographing polar bears for years, but to see this footage from a drones perspective was epic! Well done and congratz on the Nominee! Well deserved.
    You are da man Florian!
    Absolutely outstanding!
    This is incredible
    jaw dropping
    This is wow amazing, love it.
    So cool! Did the bears react to the drone at all?
    Congratulations! It's awesome! I am watching in tears....
    Awesome!
    perfect video awesome
    It is very, very beautiful !!! Sincere congratulations
    Made my day, exquisite, thank you
    Wow
    Super!
    Marvelous!
    Man this is incredible!
    Material is good, but  edi is bad. This history about  beer's family...
    Muy bueno!