Search code examples
javaregexspacesalphanumeric

Java - Unknown characters passing as [a-zA-z0-9]*?


I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.

When I run this,

Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
       System.out.println(someData); //someData contains gottenData

certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle): screenshot

In case you're wondering, it DOES also display Text, it's not all like that.

For now, I don't mind the [?] as long as it also contains some string along with it.

Please help.

[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)


Solution

  • The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.