Search code examples
regexregex-lookarounds

How to match names separated by "and" excluding "and" itself using regex?


I am trying to solve http://play.inginf.units.it/#/level/10

I have some strings as follows:

title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},

title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},

title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},

I need to match the names in bold. I tried the following regex:

(?<=author={).+(?=})

But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?


Solution

  • It took me a little while to get the samples to show up in your link. What about:

    (?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)+
    

    See an online demo


    • (?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0+ whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
    • \K - Reset starting point of reported match;
    • (?:(?! and |},).)+ - Match any if it's not followed by ' and ' or match a '}' followed by a comma.

    Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:

    (?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)+
    

    See an online demo


    In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:

    (?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*
    

    See the demo

    • (?<= - Open lookbehind;
      • \bauthor={ - Match word-boundary and literally 'author={';
      • (?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0+ (lazy) characters. Close lookbehind;
    • \b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0+ non-whitespace chars;
    • (?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.