I am using Jsoup to parse YouTube page but content i got is not as i got from browser Code
Content I got from Jsoup, body starts with
<iframe src="https://accounts.google.com/ServiceLogin?uilel=3&service=youtube&passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26feature%3Dpassive%26next%3D%252Fsignin_passive%26hl%3Dar&hl=ar" style="display: none"></iframe><!-- end of chunk -->
while from browser body starts with
<body dir="rtl" >
<script >
if (window.ytcsi) {window.ytcsi.tick("bs", null, '');} ytcfg.set('initialBodyClientWidth', document.body.clientWidth);
window.ytcfg.set('SERVICE_WORKER_KILLSWITCH', false);
</script>
my code
fun main(){
parse("https://www.youtube.com/playlist?list=PL2-FkZlJhxqVXZO1c6gKgsAdiet0zcOAO")
}
fun parse(baseLink: String) {
val doc: Document = Jsoup.connect(baseLink).get()
println("contennt : ${doc.body()}")
// val items = doc.select("a.ytd-playlist-video-renderer")
val items = doc.select("ytd-playlist-video-renderer")
items.forEach {
// print header
println("Item : $it")
val img = it.select("imge#img")
val imgLink = img.attr("src")
println("Image : $img , Link $imgLink")
}
}
Jsoup version 1.13.1
I don't have a specific answer, but speaking from experience, I can say that this behaviour is quite common with complex websites like YouTube. The web server will be returning different content based on the headers being sent by your browser or Jsoup.
To debug this kind of issue, I suggest making the request in Chrome or Firefox with the Network developer tools open. Select the request for the URL, right click and Copy as cURL. Paste this into a notepad and make this curl request in your terminal. It should return HTML exactly the same as what is in your web browser. In your notepad, remove the cookie header and try it again - see if it still contains the content you want. Then try removing the other headers one by one. Try to come up with the simplest curl command you can that will give the correct response. You will find some headers, if missing, will result in Youtube returning something different. Once you cannot simplify the curl command any further, copy these headers to JSoup and you should find it works the same. For example, you may need to do something like:
val doc: Document = Jsoup.connect(baseLink).header("Accept-Language", "en-GB,en;q=0.5").get()