Search code examples
pythonpython-re

Re findall strings inside quote except when the quote is preceded by a number


I have a long dataframe with a column that I need to extract strings that occur between quotes. The below code works well for about 90% of the dataframe column; however, there are instances were a quote is used immediately after a number to represent inches. My dilema is that when this occurs, it throws off the resulting list.

Similar questions have been asked on stack which helped me get this far and introduced me to re; however, the double-quote following the number has posed a new challenge for me.

Reading through the re documentation, I believe there is a way to exclude patterns. Looking for input on how I might go about looking for the pattern below but ignoring when a [0-9]" pattern happens?

import re

string = 'Res: Ext="" Tl=" PLIERS, 7"" Ma="337" Ml="4" Ms="A" N="" Mfpn="" M="PRCH" loc=""'

#this line works about 90% of the time but not in cases like this due to the "" following the 7 above

re.findall('"([^"]*)"', string)

# prints --> ['', ' PLIERS, 7', ' Ma=', ' Ml=', ' Ms=', ' N=', ' Mfpn=', ' M=', ' loc=']
# goal --> ['', ' PLIERS, 7"', '337', '4', 'A', '', '', 'PRCH', '']

Solution

  • Just add an optional extra quote in the match sequence, so we essentially close with either one or two quotes:

    re.findall('"([^"]*"?)"', string)