Search code examples
python-3.xparsingwebgigya

Gigya API can get hidden comments but not the visible one


I met a very weird problem when I am trying to parse data from a JavaScript written website. Maybe because I am not an expert of web development.

Here is what happened:

I am trying to get all the comments data from The Globe and Mail. If you check its source code, there is no way to use Python and parse the comments data from the source code, everything is written in JavaScript.

However, there is a magic tool called "Gigya" API, it could return all the comments from a JS written website. Gigya getComments method

When I was using these lines of code in Python Scrapy Spider, it could return all the comments.

data = {"categoryID": self.categoryID,
                "streamID": streamId,
                "APIKey": self.apikey,
                "callback": "foo",
                "threadLimit": 1000   # assume all the articles have no more then 1000 comments
                }
r =   urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
comments_lst = loads(r.read().decode("utf-8"))["comments"]

However, The Globe and Mail is updating their website, all the comments posted before Nov. 28 have been hiden from the web for now. That's why on the sample url I am shoing here, you could only see 2 comments, because they were posted after Nov. 28. And these 2 new comments have been added the new feature - the "React" button.

The weird thing is, righ now when I am running my code, I can get all those hidden hundreds of comments published before Nov. 28, but cannot get the new commnets we can see on the website now.

I have tried all the Gigya comment related methods, none of them worked, other Gigya methods, do not look like something helpful...

Is there any way to solve this problem?

Or at least, do you know why, I can get all the hiden comments but cannot get visible new commnets that have the new features?


Solution

  • Finally, I solved the problem with Python selenium library, it's free and it's super cool.

    So, it seems although in the source code of JS written website, we could not see the content, it's in fact has HTML page where we can parse the content.

    1. First of all, I installed Firebug on Firefox, with this Addon, I'm able to see the HTML page of the url and it's very easy to help you locate the content, just search key words in Firebug

    2. Then I wrote the code like this:

      from selenium import webdriver
      import time
      def main():
      comment_urls = [
      "http://www.theglobeandmail.com/opinion/a-fascists-win-americas-moral-loss/article32753320/comments/"
                 ]
      
      for comment_url in comment_urls:
          driver = webdriver.Firefox()
          driver.get(comment_url)
          time.sleep(5)
          htmlSource = driver.page_source
          clk = driver.find_element_by_css_selector('div.c3qHyJD')
          clk.click()
          reaction_counts = driver.find_elements_by_class_name('c2oytXt')
          for rc in reaction_counts:
              print(rc.text)
      
      if __name__ == "__main__":
          main()
      

    The data I am parsing here are those content cannot be found in HTML page until you click the reaction image on the website. What makes selenium super cool is that click() method. After you found the element you can click, just use this method, then those generated elements will appear in the HTML and become parsable. Super cool!