Search code examples
kotlinjsouphtml-parsing

Use GetElementsByClass to find all <div> elements by class name, nested inside a <p> element


I am creating a parser using Jsoup in Kotlin

I need to get a inner text of a tag with class "ptrack-content" inside the tag with class "titleCard-synopsis"

When I am trying to getElementsByClass in a element objects that created by a former getElementsByClass, I getting 0 elements

Code:

class NetlifxHtmlParser {

    val html = """
         <div class="titleCardList--metadataWrapper">
            <div class="titleCardList-title"><span class="titleCard-title_text">Map Her</span><span><span class="duration ellipsized">50m</span></span></div>

            <p class="titleCard-synopsis previewModal--small-text">
            <div class="ptrack-content">A hidden map rocks Hartley High as the students' sexcapades are publicly exposed. Caught as the culprit, Amerie becomes an instant social pariah.</div>
            </p>

         </div>
         
          <div class="titleCardList--metadataWrapper">
             <div class="titleCardList-title"><span class="titleCard-title_text">Renaissance Titties</span><span><span class="duration ellipsized">50m</span></span></div>
             <p class="titleCard-synopsis previewModal--small-text">
             <div class="ptrack-content">Amerie, the new outcast, receives a party invitation that gives her butterflies. But when she manages to show up, a bitter surprise awaits.</div>
             </p>
          </div>
    """.trimIndent()

    fun parseEpisode() {
        val doc = Jsoup.parseBodyFragment(html)
        val titleCards = doc.getElementsByClass("titleCard-synopsis")
        println("Episode: count titleCard = > ${titleCards.count()}") // 2

        titleCards.forEachIndexed { index, element ->
            val ptrack = element.getElementsByClass("ptrack-content")
            println("Episode: count ptrack = > ${ptrack.count()}") // 0 !!
            println("inner html = > ${ptrack.html()}") // null string !!

        }

    }
}

In the above code,

First, I am extracting tags with class name titleCard-synopsis.

For that , I using doc.getElementsByClass("titleCard-synopsis") which returns 2 element items.

Then, In the List of titleCard elements, I am extracting the elements that have ptrack-content as Class, by using the same getElementsByClass in each element,

which returns empty list.

Why this is happening ?

My goal is, I need to extract the description text for each title, the stored in the interior tags of p tag with class titleCard-synopsis.

If I try to get directly from "ptrack-content", it's working fine, but this a general class used in many places in the main HTML source. (this is snippet)

I need to get a inner text of a tag with class "ptrack-content" inside the tag with class "titleCard-synopsis"

But in the above method in the code, I am only getting emtpy list.

Why ?

Also note that, if I invoke the HTML() method in a element object of titleCards(ptrack.html()), I am not getting the inner DIV tag, an empty string!!!

Please guide my to resolve the issue !


Solution

  • TL;DR

    I need to get a inner text of a tag with class "ptrack-content" inside the tag with class "titleCard-synopsis"

    I'm not really familiar with Kotlin, but this should produce the desired output:

    val doc = Jsoup.parseBodyFragment(html)
    val result = doc.select(".titleCard-synopsis + .ptrack-content")
    
    result.forEachIndexed {index, element -> 
        println("${element.html()}")
    }
    

    Live example


    This is an interesting problem!

    You basically have an invalid HTML and jsoup is smart enough to auto-correct it for your. Your HTML structure gets altered and suddenly your query does not work.

    This is the error:

    <p class="titleCard-synopsis previewModal--small-text">
      <div class="ptrack-content">A hidden map rocks Hartley High as the students' sexcapades are publicly exposed. Caught as the culprit, Amerie becomes an instant social pariah.</div>
    </p>
    

    You can't nest a <div> element inside a <p> element like that.

    Paragraphs are block-level elements, and notably will automatically close if another block-level element is parsed before the closing </p> tag. [Source: <p>: The Paragraph element]

    Also, look at Nesting block level elements inside the <p> tag... right or wrong?

    This is how jsoup parses your tree:

    <html>
     <head></head>
     <body>
      <div class="titleCardList--metadataWrapper">
       <div class="titleCardList-title">
        <span class="titleCard-title_text">Map Her</span><span><span class="duration ellipsized">50m</span></span>
       </div>
       <p class="titleCard-synopsis previewModal--small-text"></p>
       <div class="ptrack-content">
        A hidden map rocks Hartley High as the students' sexcapades are publicly exposed. Caught as the culprit, Amerie becomes an instant social pariah.
       </div>
       <p></p>
      </div>
      <div class="titleCardList--metadataWrapper">
       <div class="titleCardList-title">
        <span class="titleCard-title_text">Renaissance Titties</span><span><span class="duration ellipsized">50m</span></span>
       </div>
       <p class="titleCard-synopsis previewModal--small-text"></p>
       <div class="ptrack-content">
        Amerie, the new outcast, receives a party invitation that gives her butterflies. But when she manages to show up, a bitter surprise awaits.
       </div>
       <p></p>
      </div>
     </body>
    </html>
    

    As you can see, elements with class titleCard-synopsis have no children with class ptrack-content.