Search code examples
javahtmlparsingjsoup

JSoup | Fetching part of the HTML


I have problem with fetch site with car ads. I would like to get advertiser's name from it. The main problem is that sometimes that name is showing in different way.

1) Name is Kajetan

(https://www.otomoto.pl/oferta/mercedes-benz-klasa-e-w211-bardzo-dobry-stan-bez-wkladu-finansowego-warszawa-ryki-ID6BEBy9.html#2bd424144f)

   <div class="seller-box__seller-info">
    <small class="seller-box__seller-registration">Sprzedający na OTOMOTO od 2015</small>
    <small class="seller-box__seller-type">Osoba prywatna</small>
    <h2 class="seller-box__seller-name"> Kajetan </h2>
   </div>

2) Name is AS MOTORS Centrum Pojazdów Używanych KIA

(https://www.otomoto.pl/oferta/kia-ceed-1-6-crdi-136-km-m-bws-fvat-salon-serwis-polska-ID6BHFu3.html#2bd424144f)

<div class="seller-box__seller-info">
 <small class="seller-box__seller-registration">Sprzedający na OTOMOTO od 2019</small>
 <small class="seller-box__seller-type">Dealer</small>
   <h2 class="seller-box__seller-name">
   <div class="seller-badge"> <img src="xx.jpg" data-toggle="tooltip" data-placement="bottom" title="" data-original-title="Ten dealer korzysta z pakietu usług Premium Plus" class="">
   </div>
    <a href="https://asmotorsuzywane.otomoto.pl" title="AS MOTORS Centrum Pojazdów Używanych KIA">AS MOTORS Centrum Pojazdów Używanych KIA</a>
    </h2>
</div>

In the first case the solution is easy because I'll do it like this:

public static String fetchOwnerName (String html) {
        Elements ownerElement = Jsoup.parse(html).getElementsByClass("seller-box__seller-info").select("h2");
        String owner = StringUtils.substringBetween(String.valueOf(ownerElement), "\">", "</h2>");
        return owner;
    }

But in the second case the problem is that after <h2> there are additional <div> and what is more, name of the advertiser is between <a href="".

How should I change fetchOwnerName method to be universal? I'm using JSOUP library to parse HTML page. Thanks for all of your suggestions.


Solution

  • You can get text inside the h2 tags without worrying about the additional tags i.e div a

    You just have to call .text()

    Elements ownerElement = Jsoup.parse(html).getElementsByClass("seller-box__seller-info").select("h2");
    String owner = ownerElement.text();
    

    This will work if no other text except advertiser's name is present between h2 tags