Search code examples
python-3.xweb-scrapingscrapyscrapy-splash

Scrapyjs + Splash does not retrieve dynamically loaded content from XHR Requests


I am trying to scrape the comment section content of this link: https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya enter image description here

However, it is dynamically loaded with Javascript through an XHR request. I have pinpointed the request with Chrome Dev Tools:

https://newcomment.detik.com/graphql?query={ search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "510762" } , {name: "news.site", terms: "cnn"} ]) { paging sorting counter counterparent profile hits { posisi hasAds results { id author content like prokontra status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } } }

It's bloated sorry, but I have also found out that the key to get the comment section of a specific articles at every request is at this specific query string param:

terms: "510762"

Unfortunately, I have not find a way to scrape the required "terms" parameter from the page so that I can simulate the request for many different pages.

That is why I am opting for Scrapyjs & Splash. I have followed the accepted solution at this link: How can Scrapy deal with Javascript

However, the response that I get from scrapy SplashRequest still does not contain javascript loaded content (the comment section)! I have set up settings.py, run splash at docker container as instructed, and modified my scrapy spider to yield this way:

            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

Is there some step that I'm missing or should I just give up and use Selenium for this? Thank you in advance.


Solution

  • You can get the article id by parsing the url directly :

    import re
    
    url = "https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya"
    articleid = re.search('(\d+)-(\d+)-(\d+)', url).group(3)
    print(f"request for article {articleid}")
    

    Note that the last string is the article id here 510762.

    Also you can get it from the meta tag with name articleid :

    from bs4 import BeautifulSoup
    import requests
    
    r = requests.get("https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya")
    soup = BeautifulSoup(r.text, "html.parser")
    print(soup.find("meta", {"name":"articleid"})["content"])
    

    If you go with the first solution, you don't need to use scraping to get the data if you know the url. Here is an example to get the comments :

    import requests
    import re
    
    url = "https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya"
    
    articleid = re.search('(\d+)-(\d+)-(\d+)', url).group(3)
    print(f"request for article {articleid}")
    
    query = """
    { 
      search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "%s" } , {name: "news.site", terms: "cnn"} ]) { 
        paging 
        sorting 
        counter 
        counterparent 
        profile 
        hits { 
          posisi 
          hasAds 
          results { 
            id 
            author 
            content 
            like 
            prokontra 
            status 
            news 
            create_date 
            pilihanredaksi 
            refer 
            liker { 
              id 
            } 
            reporter { 
              id 
              status_report 
            } 
            child { 
              id 
              child 
              parent 
              author 
              content 
              like 
              prokontra 
              status 
              create_date 
              pilihanredaksi 
              refer 
              liker { 
                id 
              } 
              reporter { 
                id 
                status_report 
              } 
              authorRefer  
            }  
          }  
        }  
      }  
    }""" % articleid
    
    r = requests.get("https://newcomment.detik.com/graphql",
        params = {
            "query": query
        })
    
    results = r.json()
    
    print([t for t in results["data"]["search"]["hits"]["results"]])