Considering the following string:
my_text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
I want to extract the name books and authors, so expected output is:
output = [
['Harry Potter', 'JK Rowling'],
['Dune (first book)', 'Frank Herbert'],
['and Le Petit Prince', 'Antoine de Saint Exupery']
]
The basic 2-step approach would be:
While this method would cover 90% of cases, the main issue is the consideration of brackets (): I want to keep them in book titles (like Dune), but use them as delimiters after authors (like Saint Exupery).
I suspect a powerful regex would cover both, but not sure how exactly
I'm not sure if that is "a powerful regex", but it does the job:
import re
text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
pattern = r" *(.+) by ((?: ?\w+)+)"
matches = re.findall(pattern, text)
res = []
for match in matches:
res.append((match[0], match[1]))
print(res) # [('Harry potter', 'JK Rowling'), ('Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery')]