Search code examples
javaweb-scrapingjsoup

How to correctly parse HTML in Java


I'm trying to extract information from websites using Jsoup but I don't get the same HTML code as in my browser.

I tried to use .userAgent() but it didn't work. I currently use the following function which works for Amazon.com:

public static String getHTML(String urlToRead) throws Exception {
      StringBuilder result = new StringBuilder();
      URL url = new URL(urlToRead);
      HttpURLConnection conn = (HttpURLConnection) url.openConnection();
      conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
      conn.setRequestMethod("GET");
      BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
      String line;
      while ((line = rd.readLine()) != null) {
         result.append(line);
      }
      rd.close();
      return result.toString();
   }

The website I'm trying to parse is http://www.asos.com/ but the price of the product is always missing.

I found this topic which is pretty close to mine but I would like to do it using only java and no external app.


Solution

  • So after a little playing around with the site I came up with a solution.

    Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.

    I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.

    Good luck and let me know if you need any further help!

    import org.json.*;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.*;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    
    public class Main
    {
        final String productID = "8513070";
        final String productURL = "http://www.asos.com/prd/";
        final Product product = new Product();
    
        public static void main( String[] args )
        {
            new Main();
        }
    
        private Main()
        {
            getProductDetails( productURL, productID );
            System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
        }
    
        private void getProductDetails( String url, String productID )
        {
            try
            {
                // Append the product url and the product id to retrieve the product HTML
                final String appendedURL = url + productID;
    
                // Using Jsoup we'll connect to the url and get the HTML
                Document document = Jsoup.connect( appendedURL ).get();
                // We parse the HTML only looking for the product section
                Element elementById = document.getElementById( "asos-product" );
                // To simply get the title we look for the H1 tag
                Elements h1 = elementById.getElementsByTag( "h1" );
    
                // Because more than one H1 tag is returned we only want the tag that isn't empty
                if ( !h1.text().isEmpty() )
                {
                    // Add all data to Product object
                    product.productID = productID;
                    product.productName = h1.text().trim();
                    product.productPrice = getProductPrice(productID);
                }
            }
            catch ( IOException e )
            {
                e.printStackTrace();
            }
        }
    
        private String getProductPrice( String productID )
        {
            try
            {
                // Append the api url and the product id to retrieve the product price JSON document
                final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
                // Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
                String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();
    
                // As its JSON we want to parse the JSONArray until we get to the current price and return it.
                JSONArray jsonArray = new JSONArray( jsonDoc );
                JSONObject currentProductPriceObj = jsonArray
                        .getJSONObject( 0 )
                        .getJSONObject( "productPrice" )
                        .getJSONObject( "current" );
                return currentProductPriceObj.getString( "text" );
            }
            catch ( IOException e )
            {
                e.printStackTrace();
            }
    
            return "";
        }
    
        // Simple Product object to store the data
        class Product
        {
            String productID;
            String productName;
            String productPrice;
        }
    }
    

    Oh, and you'll also need org.json for parse the JSON response from the API.