Search code examples
regexelixirpcre

Regex to match 1 or 2 occurrences


I have text with following structure:

book_name:SoftwareEngineering;author:John;author:Smith; book_name:DesignPatterns;author:Foo;author:Bar;

Element separator is ;

Two author elements could follow book_name element

There could be 2 to 10 books

One book should have at least one author, but maximum 2 authors

I would like to extract book_name and individual authors for every book.

I tried regex with .scan method (which collects all matches):

iex> regex = ~r/book_name:(.+?;)(author:.+?;){1,2}/
iex> text = "book_name:SoftwareEngineering;author:John;author:Smith;book_name:DesignPatterns;author:Foo;author:Bar;"

iex> Regex.scan(regex, text, capture: :all_but_first)
[["SoftwareEngineering;", "author:Smith;"], ["DesignPatterns;", "author:Bar;"]]

But it doesn't collect authors correctly. It collects only second author of the book. Can anybody help with the problem?


Solution

  • This part (author:.+?;){1,2} of the pattern repeats 1-2 times author including what follows up till the semicolon but repeating the capturing group like that will only give you the last capturing group. This page might be helpful.

    Instead of using a non greedy quantifier .*? you could match not a semicolon repeating a negated character class [^;]+ that matches not the semicolon.

    You might also make use of a capturing group and a backreference for author. The name of the book is in capturing group 1, the name of the first author in group 3 and the optional second author in group 4.

    book_name:([^;]+);(author):([^;]+);(?:\2:([^;]+);)?
    

    That will match

    • book_name: Match literally
    • ([^;]+); Group 1 matching not ; then match ;
    • (author): Group 2 author
    • ([^;]+); Group 3 matching not ; then match ;
    • (?: Non capturing group
      • \2: backreference to what is captured in group 2
      • ([^;]+); Group 4 matching not ; then match ;
    • )? Close non capturing group and make it optional

    regex101 demo