Search code examples
c++regexperlpcreecmascript-5

conditional group matching using regex


how to match a group except if it starts with a certain character.

e.g. I have the following sentence:

just _checking any _string.

I have the regex ([\w]+) which matches all the words {just, _checking, any, _sring}. But, what I want is to match all the words that don't start with character _ i.e. {just, any}.

The above example is a watered down version of what I'm actually trying to parse.

I'm parsing a code file, which contains string in the following format :

package1.class1<package2.class2 <? extends package3.class3> , package4.class4 <package5.package6.class5<?>.class6.class7<class8> >.class9.class10

The output that I require should create a match result like all the fully qualified names (having at least one . in the middle )but stop if encounter a <.

So, the result should be :

{ package1.class1, package2.class2, package3.class3, package4.class4, package5.package6.class5 }

I wrote ([\w]+\.)+([\w]+) to parse it but it also matches class6.class7 and class9.class10 which I don't want. I know it's way off the mark and I apologize for that.

Hence, I earlier asked if I can ignore a capture group starting from a specific character.

Here's the link where I tried : regex101

there everything that it is matching is correct except the part where it matches class6.class7 and class9.class10.

I'm not sure how to proceed on this. I'm using C++14 and it supports ECMAScript grammar as well along with POSIX style.

EDIT : as suggested by @Corion, I've added more details. EDIT2 : added regex101 link


Solution

  • Just use a word boundary \b and make sure that the first character is not an underscore (but still a letter):

    (\b(?=[^_])[\w]+)
    

    Using the following Perl script to validate that:

    perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_])[\w]+)/g"
    
    Matched <just>
    Matched <any>
    

    regex101 playground

    In response to the expansion of the question in the comment, the following regular expression will also capture dots in the "middle" of the word (but still disallow them at the start of a word):

    (\b(?=[^_.])[\w.]+)
    
    perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_.])[\w.]+)/g"
    
    just _checking any _string. and. this. inclu.ding dots
    Matched <just>
    Matched <any>
    Matched <and.>
    Matched <this.>
    Matched <inclu.ding>
    Matched <dots>
    

    regex101 playground

    After the third expansion of the question, I've expanded the regular expression to match the class names but exclude the extends keyword, and only start a new match when there was a space (\s) or less-than sign (<). The full qualified matches are achieved by forcing a dot ( \. ) to appear in the match:

    (?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))
    
    perl -nwle "print qq(Matched <$_>) for /(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))/g"
    
    Matched <package1.class1>
    Matched <package2.class2>
    Matched <package3.class3>
    Matched <package4.class4>
    Matched <package5.package6.class5>
    

    regex 101 playground