Search code examples
regexstringregex-negationregex-groupregex-greedy

RegEx for matching everything except new lines and a special char


I was working on a HW problem that involves removing all of the html tags "<...>" from the text of an html code and then count all of the tokens in that text.

I wrote a solution that works but it all comes down to a single line of code that I didn't actually write and I'm curious to learn more about how this kind of code works.

public static int tagStrip(Scanner in) {
     int count = 0; 

     while(in.hasNextLine()) {
         String line = in.nextLine();

         line = line.replaceAll("<[^>\r\n]*>", "");

         Scanner scan = new Scanner(line);

         while(scan.hasNext()) {
            String word = scan.next();
            count++;
         }
     }
     return count;
}  

Line 7 is the one I'm curious about. I understand how the replaceAll() method works. I'm not sure how that String "<[^>\r\n]*>" works. I read a little bit about patterns and messed around with it a bit.
I replaced it with "<[^>]+>" and it still works exactly the same. So I was hoping somebody could explain how these characters work and what they do especially within the construct of this type of program.


Solution

  • RegEx

    If you wish to explore or modify your expression, you can modify/change your expressions in regex101.com.

    <[^>]+> may not work since it would pass your new lines, which seems to be undesired.

    enter image description here

    RegEx Circuit

    You can also visualize your expressions in jex.im:

    enter image description here