Search code examples
javahref

How can I extract all links (href) in an HTML file?


I am trying to extract all links from an HTML file using Java.

The pattern seems to be <a href = "Name">. I would like to obtain the URL that would enable me to access the desired webpage.

Can you guys help me out with an approach (string.contains? string.indexof?)?

Thank you.


Solution

  • A basic fundamentals approach would be to use regex matching.

        String html = "YOUR HTML";
        String regex = "<a href\\s?=\\s?\"([^\"]+)\">";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(html);
        int index = 0;
        while (matcher.find(index)) {
            String wholething = matcher.group(); // includes "<a href" and ">"
            String link = matcher.group(1); // just the link
            // do something with wholething or link.
            index = matcher.end();
        }
    

    On the other hand, you could use something like Document. I don't know much about this.