I was working on a HW problem that involves removing all of the html tags "<...>" from the text of an html code and then count all of the tokens in that text.
I wrote a solution that works but it all comes down to a single line of code that I didn't actually write and I'm curious to learn more about how this kind of code works.
public static int tagStrip(Scanner in) {
int count = 0;
while(in.hasNextLine()) {
String line = in.nextLine();
line = line.replaceAll("<[^>\r\n]*>", "");
Scanner scan = new Scanner(line);
while(scan.hasNext()) {
String word = scan.next();
count++;
}
}
return count;
}
Line 7 is the one I'm curious about. I understand how the replaceAll() method works. I'm not sure how that String "<[^>\r\n]*>" works. I read a little bit about patterns and messed around with it a bit.
I replaced it with "<[^>]+>" and it still works exactly the same. So I was hoping somebody could explain how these characters work and what they do especially within the construct of this type of program.
If you wish to explore or modify your expression, you can modify/change your expressions in regex101.com.
<[^>]+>
may not work since it would pass your new lines, which seems to be undesired.
You can also visualize your expressions in jex.im: