Search code examples
javahtmlweb-scrapingstring-operations

Extract links from a web page in core Java using indexOf, substring vs pattern matching


I am trying to get the links in a web page using core java. I am following the below code given in Extract links from a web page with some modifications.

        try {
            url = new URL("http://www.stackoverflow.com");
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains("href="))
                    System.out.println(line.trim());
            }
        }

With respect extracting each link, most of the answers in the above post suggests using pattern matching. However as per my understanding Pattern matching is expensive operation. So I want to use indexOf and substring operations to get the link text from each line as below

   private static Set<String> getUrls(String line, int firstIndexOfHref) {
        int startIndex = firstIndexOfHref;
        int endIndex;
        Set<String> urls = new HashSet<>();

        while(startIndex != -1) {
            try {
                endIndex = line.indexOf("\"", startIndex + 6);
                String url = line.substring(startIndex + 6, endIndex);
                urls.add(url);
                startIndex =  line.indexOf("href=\"http", endIndex);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        return urls;
    }

I have tried this on few pages and it's working properly. However I am not sure if this approach always works. I want to know if this logic can fail in some real time scenarios.

Please help.


Solution

  • Your code is relying a good format of html in one line, it will not handle various other ways to reference <a href such as with single quotes, no quotes, extra whitespace including new lines between "a" and "href" and "=", relative paths, other protocols such as file: or ftp:.

    Some examples you would need to consider:

    <a href 
       =/questions/63090090/extract-links-from-a-web-page-in-core-java-using-indexof-substring-vs-pattern-m 
    

    or

    <a href = 'http://host'
    

    or

    <a 
    href = 'http://host'
    

    That's why the other question has many answers including HTML validator, and regex patterns.