Search code examples
javahtmljsoup

JSoup - Checking certain elements to see if they have text and then choosing only one


I'm trying to extract a price off of Amazon using JSoup, but there are two different elements where I can extract it. I can get it off of the aria-label attribute in the element, or I can get it from the text within the element . Preferably, I would always like to get it from the aria-label attribute, but sometimes it doesn't exist, so I need to extract it from the second span class. My question is, how can I create an if-statement that checks the attribute to see if it has any text, and then if it doesn't, to try and extract the text from the second span class?

Also, I'm trying to get several prices from classes that are named identically, but when I write doc.select("span.sx-price.sx-price-large").get(0).text() for example, nothing pops up.

Here is the HTML code for one of the items that I want to extract an item from:

<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/B01MZYYWUH">1</a></div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary a-text-bold">Product Description</span><br><span class="a-size-small a-color-secondary">... Cards Radeon&trade; <em>RX</em> 460 Graphics Cards Radeon&trade; R9 <em>390</em> Graphics Cards ...</span></div>
</div></div></div></div></div></div></li>
<li id="result_2" data-asin="B00IAAU6SS" class="s-result-item celwidget ">
   <div class="s-item-container">
   <div class="a-fixed-left-grid">
   <div class="a-fixed-left-grid-inner" style="padding-left:218px">
   <div class="a-fixed-left-grid-col a-col-left" style="width:218px;margin-left:-218px;_margin-left:-109px;float:left;">
      <div class="a-row">
         <div aria-hidden="true" class="a-column a-span12 a-text-center">
            <a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/B00IAAU6SS"><img src="https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US218_.jpg" srcset="https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US218_.jpg 1x, https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US327_FMwebp_QL65_.jpg 1.5x, https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US436_FMwebp_QL65_.jpg 2x, https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US500_FMwebp_QL65_.jpg 2.2935x" width="218" height="218" alt="Product Details" class="s-access-image cfMarker" data-search-image-load></a>
            <div class="a-section a-spacing-none a-text-center"></div>
         </div>
      </div>
   </div>
   <div class="a-fixed-left-grid-col a-col-right" style="padding-left:2%;*width:97.6%;float:left;">
   <div class="a-row a-spacing-small">
      <div class="a-row a-spacing-none scx-truncate-medium sx-line-clamp-3 s-list-title-long">
         <a class="a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal" title="Arctic Accelero Xtreme IV 280(X) - High-End Graphics Card Cooler with Backside Cooler for Efficient RAM and VR-Cooling - DCACO-V930001-GBA01" href="https://rads.stackoverflow.com/amzn/click/B00IAAU6SS">
            <h2 data-attribute="Arctic Accelero Xtreme IV 280(X) - High-End Graphics Card Cooler with Backside Cooler for Efficient RAM and VR-Cooling - DCACO-V930001-GBA01" data-max-rows="3" class="a-size-medium s-inline  s-access-title  a-text-normal">Arctic Accelero Xtreme IV 280(X) - High-End Graphics Card Cooler with Backside Cooler for Efficient RAM and VR-Cooling - DCACO-V930001-GBA01</h2>
         </a>
      </div>
      <div class="a-row a-spacing-none"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">ARCTIC</span></div>
   </div>
   <div class="a-row">
   <div class="a-column a-span7">
   <div class="a-row a-spacing-none"><a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/B00IAAU6SS"><span aria-label="$85.99" class="a-color-base sx-zero-spacing"><span class="sx-price sx-price-large">
      <sup class="sx-price-currency">$</sup>
      <span class="sx-price-whole">85</span>
      <sup class="sx-price-fractional">99</sup>
      </span>
      </span></a><span class="a-letter-space"></span><i class="a-icon a-icon-prime a-icon-small s-align-text-bottom" aria-label="Prime"><span class="a-icon-alt">Prime</span></i>
   </div>
   <div class="a-row a-spacing-mini">
      <div class="a-row a-spacing-none"><span class="a-size-small a-color-secondary">FREE Shipping on eligible orders</span></div>
      <div class="a-row a-spacing-none"><span class="a-size-small a-color-price">Only 8 left in stock - order soon.</span></div>
   </div>
   <div class="a-row a-spacing-mini">
   <div class="a-row a-spacing-none">
      <div class="a-row a-spacing-mini"></div>
      <span class="a-size-small a-color-secondary">More Buying Choices</span>
   </div>
   <div class="a-row a-spacing-none">
   <a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/B00IAAU6SS"><span class="a-color-secondary a-text-strike"></span><span class="a-size-base a-color-base">$85.99</span>


Solution

  • I would suggest selecting element with class .sx-price since its name suggests it contains a price. Then you can select parent element where aria-label attribute is expected, check if it contains price using simple regular expression - if true, take price directly from this attribute, otherwise collect data from nested child spans.

    Below you can find a code I have play around with, works pretty well.

    final Document doc = Jsoup.parse(html);
    
    final Elements prices = doc.select(".sx-price");
    
    final Pattern pattern = Pattern.compile("^\\$?([0-9]+)\\.([0-9]{2})$");
    
    for (Element el : prices) {
        String price = "";
        if (el.parent().hasAttr("aria-label") && pattern.matcher(el.parent().attr("aria-label")).find()) {
            System.out.println("Extracting price from aria-label...");
            price = el.parent().attr("aria-label");
    
        } else {
            System.out.println("Extracting price from span body...");
            String currency = el.select(".sx-price-currency").text();
            String whole = el.select(".sx-price-whole").text();
            String fractional = el.select(".sx-price-fractional").text();
    
            price = String.format("%s%s.%s", currency, !whole.isEmpty() ? whole : "00", !fractional.isEmpty() ? fractional : "00");
        }
    
        System.out.println(price);
    }
    

    I hope it helps.