Search code examples
cregexposix

Unable to form the required regex in C


I am trying to write a regex which can search a string and return true if it matches with the regex and false otherwise.

Check should ensure string is wildcard domain name of a website.

Example:

  • *.cool.dude is valid

  • *.cool is not valid

  • abc.cool.dude is not valid

So I had written something which like this

\\*\\.[.*]\\.[.*]

However, this is also allowing a *.. string as valid string because * means 0 or infinite occurrences.

I am looking for something which ensures that at-least 1 occurrence of the string happens.

Example: *.a.b -> valid but *.. -> invalid

how to change the regex to support this?

I have already tried doing something like this:

\\*\\.([.*]{1,})\\.([.*]{1,}) -> doesnt work

\\*\\.([.+])\\.(.+) -> doesnt work

^\\*\\.[a-zA-Z]+\\.[a-zA-Z]+ -> doesnt work

I have tried a bunch of other options as well and have failed to find a solution. Would be great if someone can provide some input.

PS. Looking for a solution which works in C.


Solution

  • [.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.

    So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)

    But that's probably not really what you want. My guess is that you want something like:

    ^[*]([.][A-Za-z0-9_]+){2,}$
    

    although you'll have to correct the character class to specify the precise set of characters you think are legitimate.

    Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.

    Some notes:

    • The {2,} requires at least two components after the *, so that *.cool doesn't match.

    • The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.

    • Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.

    For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.