Search code examples
pythonstringsplit

Split string in Python using two conditions (one delimiter and one "contain")


Considering the following string:

my_text = """
    My favorites books of all time are:
    Harry potter by JK Rowling,
    Dune (first book) by Frank Herbert;
    and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""

I want to extract the name books and authors, so expected output is:

output = [
    ['Harry Potter', 'JK Rowling'],
    ['Dune (first book)', 'Frank Herbert'],
    ['and Le Petit Prince', 'Antoine de Saint Exupery']
]

The basic 2-step approach would be:

  • Use re.split to split on a list of non ascii characters ((),;\n etc) to extract sentences or at least pieces of sentences.
  • Keep only strings containing 'by' and use split again on 'by' to separate title and author.

While this method would cover 90% of cases, the main issue is the consideration of brackets (): I want to keep them in book titles (like Dune), but use them as delimiters after authors (like Saint Exupery).

I suspect a powerful regex would cover both, but not sure how exactly


Solution

  • I'm not sure if that is "a powerful regex", but it does the job:

    import re
    
    text = """
    My favorites books of all time are:
        Harry potter by JK Rowling,
        Dune (first book) by Frank Herbert;
        and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
    """
    
    pattern = r" *(.+) by ((?: ?\w+)+)"
    
    matches = re.findall(pattern, text)
    
    res = []
    for match in matches:
        res.append((match[0], match[1]))
    
    print(res) # [('Harry potter', 'JK Rowling'), ('Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery')]