Search code examples
regextagsabsolutecorpusphrase

Creating more complex regexes from TAG format


So I can't figure out what's wrong with my regex here. (The original conversation, which includes an explanation of these TAG formats, can be found here: Translate from TAG format to Regex for Corpus).

I am starting with a string like this:

Arms_NNS folded_VVN ,_,

The NNS could also NN, and the VVN could also be VBG. And I just want to find that and other strings with the same tags (NNS or NN followed b VVN or VBG followed by comma).

The following regex is what I am trying to use, but it is not finding anything:

[\w-]+_(?:NN|NNS)\W+[\w-]+ _(?:VBG|VVN)\W+[\w-]+ _,

Solution

  • Given the input string

    Arms_NNS folded_VVN ,_,
    

    the following regex

    (\w+_(?:NN|NNS) \w+_(?:VBG|VVN) ,_,)
    

    matches the whole string (and captures it - if you don't know what that means, that probably means it doesn't matter to you).

    Given a longer string (which I made up)

    Dog_NN Arms_NNS folded_VVN ,_, burp_VV
    

    it still matches the part you want.

    If the _VVN part is optional, you can use

    (\w+_(?:NN|NNS) (?:\w+_(?:VBG|VVN) )?,_,)
    

    which matches either witout, or with exactly one, word_VVN / word_VBG part.


    Your more general questions:

    I find it hard to explain how these things work. I'll try to explain the constituent parts:

    • \w matches word characters - characters you'd normally expect to find in words
    • \w* matches one-or-more of them
    • (NN|NNS) means "match NN or NNS"
    • ?: means "match but don't capture" - suggest googling what capturing means in relation to regexes.
    • ? alone means "match 0 or 1 of the thing before me - so x? would match "" or "x" but not "xx".
    • None of the characters in ,_, are special, so we can match them just by putting them in the regex.

    One problem with your regex is that \w will not match a comma (only "word characters").

    I don't know what [\w-] does. Looks a bit weird. I think it's probably not valid, but I don't know for sure.

    My solution assumes there is exactly one space, and nothing else, between your tagged words.