I'm learning Java JSoup, and I want to scrape the comments and the names of the people commenting from a youtube video.
I chose an arbitrary youtube video, and inspected the elements of interest. I've looked at https://jsoup.org/cookbook/extracting-data/selector-syntax and Why Exception Raised while Retrieving the Youtube Elements using iterator in Jsoup?, but I don't really understand how to use the syntax.
Currently, the output of my code is two empty lists. I want the output to be one list with the comments, and the other list with the names of the commentators.
Thanks for any help!
import java.io.IOException;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class FirstJsoupExample {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Document page = Jsoup.connect("https://www.youtube.com/watch?v=C33Rw0AA3aU").get();
// Comments
Elements Comments = page.select("yt-formatted-string[class=style-scope ytd-comment-renderer]");
ArrayList<String> CommentsList = new ArrayList<String>();
for (Element comment : Comments) {
CommentsList.add("Comment: " + comment.text());
}
// Commentators
Elements Comentators = page.select("span[class= style-scope ytd-comment-renderer]");
ArrayList<String> ComentatorList = new ArrayList<String>();
for (Element comentator : Comentators) {
ComentatorList.add("Comentator: " + comentator.text());
}
System.out.println(ComentatorList);
System.out.println(CommentsList);
}
}
The comments are not in the HTML file. Youtube uses Javascript to load the comments, but JSoup can only read the HTML file.
Your web browser's developer tools show you what is currently in the webpage, which may be different from what is in the HTML file.
To view the HTML file, you can open the Youtube page in your browser then right-click and choose 'View Page Source', or go to this URL:
view-source:https://www.youtube.com/watch?v=C33Rw0AA3aU
Then you will be able to confirm that the source does not contain yt-formatted-string
or ytd-comment-renderer
.
Youtube probably does this for two reasons:
My suggestion is to choose a different website to learn JSoup with.
I confirmed that the selectors below DO work if you:
Note that if you use the class=
form, the class to select must be in quotes.
Comments:
document.querySelectorAll("yt-formatted-string[class=\"style-scope ytd-comment-renderer\"]");
//or
document.querySelectorAll("yt-formatted-string.style-scope.ytd-comment-renderer");
Commentators:
document.querySelectorAll("span[class=\"style-scope ytd-comment-renderer\"]");
//or
document.querySelectorAll("span.style-scope.ytd-comment-renderer");