Search code examples
regextagslinguisticsnegative-lookbehind

Negative lookbehind in RegEx: Matching multiple POS-tags at once


I am still fairly new to regex, so I would appreciate any help. I am trying to use regular expressions to find specific grammatical patterns in a text corpus that was part-of-speech-tagged using the CLAWS7 tagset. Here is a sample:

Ya_UH and_CC then_RT uhm_NN1 we_PPIS2 wrote_VVD in_RP but_CCB already_RR taken_VVN up_RP that_DD1 day_NNT1 that_CST we_PPIS2 wanted_VVD actually_RR they_PPHS2 said_VVD still_RR available_JJ you_PPY know_VV0 so_RR by_II that_DD1 time_NNT1 we_PPIS2 we_PPIS2 write_VV0 in_II our_APPGE letter_NN1 two_MC weeks_NNT2 later_RRR already_RR taken_VVN up_RP Quite_RG good_RR uh_UH P ICE-SIN:S1A-001#74:1:B Ask_VV0 her_PPHO1 I_PPIS1 left_VVD my_APPGE house_NN1 at_II one_MC1 met_VVD PRO_NN1 in_II school_NN1 at_II two_MC Ya_PPY so_RR waited_VVD you_PPY know_VV0 they_PPHS2 say_VV0 half_DB hour_NNT1 later_RRR And_CC and_CC it_PPH1 was_VBDZ still_RR drizzling_JJ and_CC raining_VVG

The pattern I am looking for is every instance of \w*\_V.*? (= every verb) that is not preceded by a pronoun. Pronouns can have these tags:

_PN _PN1 _PNQO _PNQS _PNQV _PNX1 _PPGE _PPH1 _PPHO1 _PPHO2 _PPHS2 _PPIO1 _PPIO2 _PPIS1 _PPIS2 _PPX1 _PPX2 _PPY

In the sample, the desired regex should ideally match:

taken_VVN
met_VVD 
Ask_VV0
waited_VVD
raining_VVG

Using the negative lookbehind, I managed to create the following expression, which only matches verbs that are not preceded by a _PPIS2 tag:

(?<!\_PPIS2)\s\w*\_V.*?

What could I do to extend it to all the other pronoun tags? I've tried the expressions below, but they either do not match anything at all or match the wrong instances.

(?<!\_P.*)\s\w*\_V.*? (no match)
(?<![\_P.*])\s\w*\_V.*? (wrong results)

Any ideas or explanations would be greatly appreciated.


Solution

  • You may use this PCRE regex in sublime:

    \b\w*_P\w*\h+\w*_V\w*(*SKIP)(*F)|\b\w*_V\w*
    

    RegEx Demo

    RegEx Details:

    • \b\w*_P\w*: Match a word with _P in it
    • \h+: Match 1+ whitespaces
    • \w*_V\w*: Match a word with _V anywhere
    • (*SKIP)(*F): skip and fail the matched substrings
    • |: OR
    • \b\w*_V\w*: Match a word with _V anywhere (these are our matches)