Search code examples
javascriptjavahtmlyoutubehtmlunit

HtmlUnit doesnt fully load page on youtube


My program is entering a youtube video link and its trying to get the commentary box. I know how to get it, but when i try to reach the div containing it, it appears as the loading div, so I'm assuming that the page is not fully loaded. I tried these solutions and none of them worked:

while(pagina.getFirstByXPath("//div[@id='comment-section-renderer']/div")
                           .toString().contains("loading")) {
    synchronized(pagina) {
        pagina.wait(2000);
    }
}

and the other way:

 cliente.waitForBackgroundJavaScript(100000);

the page loads from gmail sign in, and i checked that the user was successfully logged in when it's loading the video page.

Here is the code of the method

public HtmlPage comentaVideo(String correo, String pass, String video, 
                             String comentario) throws ... {

    String url= "https://www.youtube.com"+video;
    HtmlPage pagina;
    HtmlDivision division;
    HtmlButton boton;
    HtmlTextInput input;

    pagina = cliente.getPage("https://www.youtube.com/watch?v=E2b9PiqobWg");

    boton = pagina.getFirstByXPath("//div[@id='yt-masthead-signin']/div/button"); 
    //press sign in button
    pagina = boton.click();

    pagina=iniciaSesion(correo,pass,pagina); //Login gmail (working)        

    System.out.println(pagina.getUrl().toString()); //just for debug

    //Trying to get the coment box div
    division = pagina.getFirstByXPath("//div[@id='comment-section-renderer']/div"); 

    //verifying that the div is correct
    System.out.println(division.toString()); 

    //some tests...
    pagina=division.click();

    boton= pagina.getFirstByXPath("//div[@id='comment-simplebox']/div/button[2]");
    pagina=boton.click();

    return pagina;

}

Now that I recognised the problem, this is the updated Method, still not working...

public HtmlPage comentaVideo(String correo, String pass, String video, String comentario) throws FailingHttpStatusCodeException, MalformedURLException, IOException, ErrorSesionNoIniciada, InterruptedException{

        String url= "https://www.youtube.com"+video;
        HtmlPage pagina;
        HtmlDivision division;
        HtmlButton boton;
        HtmlTextInput input;

        pagina = cliente.getPage("https://www.youtube.com/watch?v=E2b9PiqobWg");

        boton = pagina.getFirstByXPath("//div[@id='yt-masthead-signin']/div/button");
        pagina = boton.click();

        pagina=iniciaSesion(correo,pass,pagina);        

        System.out.println(pagina.getUrl().toString());


        //Parte no funcional

        division = pagina.getFirstByXPath("//div[@id='comment-section-renderer']/div"); 


        boton = division.getFirstByXPath("//div[@id='comment-section-renderer']/div[2]/button"); //best comments button

    while(boton == null){ //while this button is not loaded
        ScriptResult sr=pagina.executeJavaScript("window.scrollBy(0,60000)");
        cliente.waitForBackgroundJavaScript(1000);
        pagina=(HtmlPage)sr.getNewPage();
        boton = division.getFirstByXPath("//div[@id='comment-section-renderer']/div[2]/button"); 
    }
    System.out.println(boton.toString());



        //just for testing
        division = pagina.getFirstByXPath("//div[@id='comment-section-renderer']/div"); 

        System.out.println(division.toString());
        pagina=division.click();


        boton= pagina.getFirstByXPath("//div[@id='comment-simplebox']/div/button[2]");
        pagina=boton.click();

        return pagina;

I also tried setting the inner page height to maximun size. (The code have unused var and throws because is just for testing, I will update it with the final version when I get the solution)

EDIT 1: CHANGED THE WHILE LOOP CONDITION, STILL NOT WORKING


Solution

  • Looking at the youtube page structure, it seems the ajax that loads the comments section is only triggered when you scroll down the page, to the point that it becomes visible in the page. You may want to try to simulate the scrolling first, then relying on your loop which waits for the "loading" string to disappear from the inner html of the container div.

    Also consider that this behaviour may change anytime soon when they roll-out an update.

    EDIT:

    after checking with chrome inspector, it seems there a lot more div elements containing the "loading" (sub)string even after the comment section is populated via ajax. I'd suggest to modify your condition for a new expected string to appear, instead of the "loading" to go. For instance you could search for "Top comments" (button text), or "Add a public comment..." (placeholder for the comment posting textarea).