I am trying to scrape the comment section content of this link: https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya
However, it is dynamically loaded with Javascript through an XHR request. I have pinpointed the request with Chrome Dev Tools:
https://newcomment.detik.com/graphql?query={ search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "510762" } , {name: "news.site", terms: "cnn"} ]) { paging sorting counter counterparent profile hits { posisi hasAds results { id author content like prokontra status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } } }
It's bloated sorry, but I have also found out that the key to get the comment section of a specific articles at every request is at this specific query string param:
terms: "510762"
Unfortunately, I have not find a way to scrape the required "terms" parameter from the page so that I can simulate the request for many different pages.
That is why I am opting for Scrapyjs & Splash. I have followed the accepted solution at this link: How can Scrapy deal with Javascript
However, the response that I get from scrapy SplashRequest still does not contain javascript loaded content (the comment section)! I have set up settings.py, run splash at docker container as instructed, and modified my scrapy spider to yield this way:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
Is there some step that I'm missing or should I just give up and use Selenium for this? Thank you in advance.
You can get the article id by parsing the url directly :
import re
url = "https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya"
articleid = re.search('(\d+)-(\d+)-(\d+)', url).group(3)
print(f"request for article {articleid}")
Note that the last string is the article id here 510762
.
Also you can get it from the meta
tag with name articleid
:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find("meta", {"name":"articleid"})["content"])
If you go with the first solution, you don't need to use scraping to get the data if you know the url. Here is an example to get the comments :
import requests
import re
url = "https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya"
articleid = re.search('(\d+)-(\d+)-(\d+)', url).group(3)
print(f"request for article {articleid}")
query = """
{
search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "%s" } , {name: "news.site", terms: "cnn"} ]) {
paging
sorting
counter
counterparent
profile
hits {
posisi
hasAds
results {
id
author
content
like
prokontra
status
news
create_date
pilihanredaksi
refer
liker {
id
}
reporter {
id
status_report
}
child {
id
child
parent
author
content
like
prokontra
status
create_date
pilihanredaksi
refer
liker {
id
}
reporter {
id
status_report
}
authorRefer
}
}
}
}
}""" % articleid
r = requests.get("https://newcomment.detik.com/graphql",
params = {
"query": query
})
results = r.json()
print([t for t in results["data"]["search"]["hits"]["results"]])