Search code examples
pythonstringsplitsubstring

split strings that contain more than one substring


I have a list of strings names

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

I want to split the strings that contain more than one of the following substrings:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

More precicely, i want to split after the last character of the word that follows the substring

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

I dont know how to implement 'more than one' condition into my code:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

Exception: when that last character is a point (e.g. Prof.), split after the second word following the substring.


update: names is more complex than i thought and follows

  1. the title-like-pattern already answered correctly ('Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose')
  2. until a second pattern of strings follows ('Mister Kelly, AWS')
  3. until a third pattern of strings follows until the end ('Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary')

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

Sometimes Secretary is followed by varying specifications. I dont care about these characters that sometimes follow Secretary until the next name occurs. They can be dropped. Of course 'Secretary' should be stored like in updated_output.

I created a - hopefully exhaustive - list specifications of the stuff that follows Secretary. Here is a representation of list: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

updated question: how can i account for the third pattern using the specification list?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']


Solution

  • Try:

    import re
    
    names = [
        "acquaintance Muller",
        "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
    ]
    substrings = ["Vice president", "affiliate", "acquaintance"]
    
    r = re.compile("|".join(map(re.escape, substrings)))
    
    out = []
    for n in names:
        starts = [i.start() for i in r.finditer(n)]
    
        if not starts:
            out.append(n)
            continue
    
        if starts[0] != 0:
            starts = [0, *starts]
    
        starts.append(len(n))
        for a, b in zip(starts, starts[1::]):
            out.append(n[a:b])
    
    print(out)
    

    Prints:

    ['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']