Search code examples
pythonregexfinance

Python Regex for Securities


I have a text file that contains security name, $ amounts, and % of the portfolio. I'm trying to figure out how to separate the companies using regex. I had an original solution that allowed me to .split('%') and then create the 3 variables I needed, but I discovered some of the securities contain % in their name and thus the solution was inadequate.

String example:

Pinterest, Inc. Series F, 8.00%$24,808,9320.022%ResMed,Inc.$23,495,3260.021%Eaton Corp. PLC$53,087,8430.047%

Current regex

[a-zA-Z0-9,$.\s]+[.0-9%]$

My current regex only finds the last company. example, Eaton Corp. PLC$53,087,8430.047%

Any help on how I can find every single instance of a company?

Solution desired

["Pinterest, Inc. Series F, 8.00%$24,808,9320.022%","ResMed,Inc.$23,495,3260.021%","Eaton Corp. PLC$53,087,8430.047%"]

Solution

  • In Python 3:

    import re
    p = re.compile(r'[^$]+\$[^%]+%')
    p.findall('Pinterest, Inc. Series F, 8.00%$24,808,9320.022%ResMed,Inc.$23,495,3260.021%Eaton Corp. PLC$53,087,8430.047%')
    

    Result:

    ['Pinterest, Inc. Series F, 8.00%$24,808,9320.022%', 'ResMed,Inc.$23,495,3260.021%', 'Eaton Corp. PLC$53,087,8430.047%']
    

    Your initial issue was that the $ anchor made the regex only match at the end of the line. However, removing the $ still split Pinterest into two entries at the % after 8.00.

    To fix that, the regex looks for a $, then a % after that, and takes everything up through the % as an entry. That pattern works for the examples you gave, but, of course, I can't know if it holds true for all your data.

    Edit The regex works like this:

    r'               Use a raw string so you don't have to double the backslashes
      [^$]+          Look for anything up to the next $
           \$        Match the $ itself (\$ because $ alone means end-of-line)
             [^%]+   Now anything up to the next %
                  %  And the % itself
                   ' End of the string