Search code examples
jsoup

What does `:first-of-type` really mean?


Given the following HTML:

<p>paragraph text 1</p>
<p>paragraph text 1</p>
<div class="heading-h3">Category Title 1</div>
<p>1. <a href="#item1">
        <strong>Item One</strong>
    </a>
    <br>2. <a href="#item2">
            <strong>Item Two</strong>
        </a>
        <br>3. <a href="#item3">
                <strong>Item Three</strong>
            </a>
            <br>4. <a href="#item4">
                    <strong>Item Four</strong>
                </a>
<div class="heading-h3">Category Title 2</div>
<p>1. <a href="#item11">
        <strong>Item Eleven</strong>
    </a>
    <br>2. <a href="#item12">
            <strong>Item Twelve</strong>
        </a>
        <br>3. <a href="#item13">
                <strong>Item Thirteen</strong>
            </a>
            <br>4. <a href="#item14">
                    <strong>Item Fourteen</strong>
                </a>

I would like to use a single jsoup selector expression that returns only the first element out of the two <div class="heading-h3">.

That is, if select("div.heading-h3") returns two elements and select("div.heading-h3").first() return only the first element of the two, I would like to use a single jsoup expression that does not resort to Elements.first() to limit the result set to a single (first) element.

At first, I thought that "div.heading-h3:first-of-type" would accomplish that, but when tested, it returns no elements at all.

What am I missing in the interpretation of the :first-of-type "structural pseudo selectors"? Is it possible to accomplish what I want in a single jsoup selector? i.e. without resorting to Elements.first()?


Solution

  • Attempting div.heading-h3:first-of-type (with the same exact HTML typed in the question) at https://try.jsoup.org/ actually works as I originally expected:

    But in my Java program this doesn't work because the actual HTML being parsed by my program is much larger.

    Assuming that the jsoup version at https://try.jsoup.org/ is the latest and greatest, I can only conclude that there are some practical limitations to jsoup which makes it behave inconsistently when dealing with huge or "difficult" (to jsoup) HTML.

    This comment in a different SO thread suggests that "jsoup can alter (fix) the DOM...", which to me means "consistency or correctness is not guaranteed".