I am trying to extract all links from an HTML file using Java.
The pattern seems to be <a href = "Name">
.
I would like to obtain the URL that would enable me to access the desired webpage.
Can you guys help me out with an approach (string.contains? string.indexof?)?
Thank you.
A basic fundamentals approach would be to use regex matching.
String html = "YOUR HTML";
String regex = "<a href\\s?=\\s?\"([^\"]+)\">";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(html);
int index = 0;
while (matcher.find(index)) {
String wholething = matcher.group(); // includes "<a href" and ">"
String link = matcher.group(1); // just the link
// do something with wholething or link.
index = matcher.end();
}
On the other hand, you could use something like Document
. I don't know much about this.