Search code examples
pythonpandasdataframestrsplit

can you capture multiple substrings in a Panda dataframe column using .str.split() in one step


I have a database with a column in a Pandas dataframe that contains a string that has two substrings (a string & a number) I extract. I do it with str.split(), but not in a very pythonic manner because split, for me at least, returns the first item of the split to the database. And I of course can assign the result to a separate dataframe complete the splits and bring it back.

I am sure there is a better way to do this, but as of yet I have not found it.

Here is an illustrative dataframe:

df = pd.DataFrame ({'d': {71: '2022-01-03', 72: '2022-01-04', 73: '2022-01-06', 74: '2022-01-07', 75: '2022-01-07'},
                    'e': {71: 'MDU(US5526901096) Cash Dividend USD 0.2175 per Share (Ordinary Dividend)',
                          72: 'KMB(US4943681035) Cash Dividend USD 1.14 per Share (Ordinary Dividend)',
                          73: 'IRM(US46284V1017) Cash Dividend USD 0.6185 per Share (Ordinary Dividend)',
                          74: 'MRK(US58933Y1055) Cash Dividend USD 0.69 per Share (Ordinary Dividend)',
                          75: 'PEP(US7134481081) Cash Dividend USD 1.075 per Share (Ordinary Dividend)'}, 
                    'f': {71: '108.75', 72: '1368', 73: '556.65', 74: '345', 75: '537.5'}})
df

Basically from this dataframe I want to extract the stock symbol and the dividend amount without losing the other columns of 'd' and 'f'.

From the string in column 'e' I extract the symbol of the stock at the beginningof the string and then the current dividend from the middle.

df_to_process['e'] = df_to_process['e'].str.split('\(', expand=True) provides me with the stock symbol.

To get the dividend amount I copy column 'e' data in the dataframe to a column I create, for purposes here 'gg'. I then split it. Since the dividend amount is in the middle of my split, the column I created gets the number I want.

df_to_process[['aa','gg','cc']]  = df_to_process['gg'].str.split('USD|per', expand=True)
or
y = df['gg'].str.split('USD|per', expand=True)

            0                   1                    2
71  MDU(US5526901096) Cash Dividend 0.2175  Share (Ordinary Dividend)
72  KMB(US4943681035) Cash Dividend 1.14    Share (Ordinary Dividend)
73  IRM(US46284V1017) Cash Dividend 0.6185  Share (Ordinary Dividend)
74  MRK(US58933Y1055) Cash Dividend 0.69    Share (Ordinary Dividend)
75  PEP(US7134481081) Cash Dividend 1.075   Share (Ordinary Dividend)

I am not very familiar with Regex and am turning to that next. But is there a way to make this work with split?


Solution

  • This is not optimized in any way, and it makes a few assumptions, but I tried to highlight them below.

    import pandas as pd
    import re
    
    def parse_row(row) :
      match = re.search('^([a-zA-Z]+).*USD (\d+\.?\d*) per', row, re.I)
      return pd.Series([match.group(1), match.group(2)])
    
    df[['symbol', 'qrtrly_div_total']] = df['e'].apply( parse_row )
    
    >>> df
                 d                                                  e       f symbol qrtrly_div_total
    71  2022-01-03  MDU(US5526901096) Cash Dividend USD 0.2175 per...  108.75    MDU           0.2175
    72  2022-01-04  KMB(US4943681035) Cash Dividend USD 1.14 per S...    1368    KMB             1.14
    73  2022-01-06  IRM(US46284V1017) Cash Dividend USD 0.6185 per...  556.65    IRM           0.6185
    74  2022-01-07  MRK(US58933Y1055) Cash Dividend USD 0.69 per S...     345    MRK             0.69
    75  2022-01-07  PEP(US7134481081) Cash Dividend USD 1.075 per ...   537.5    PEP            1.075
    

    As you mentioned, regex would allow you to capture both elements and assign them to new columns. python3 doc for regex

    If column E is not consistent, you'll need to modify the regex, but here the elements in the one above.

    '^([a-zA-Z]+).*USD (\d+\.?\d*) per'

    ^ indicates to start at the beginning

    [a-zA-Z]+ will check for all letters at the beginning until a letter is not met (the assumption is the stock will be at least 1 letter in length, and the match will stop when meeting the (

    .*USD matches all content until we come to USD

    \d+.?\d* Should match a number, with or without the decimal, and if decimal is present, then all numbers after it

    per at the end is to help isolate the number match between "USD xx.xx per"

    If optimization is required, such as for large datasets, pd.apply is not the ideal function to use, and I would recommend looking into vectorization techniques.

    Edit: If you wanted to name the matches, you could do this.

    match = re.search('^(?P<symbol>[a-zA-Z]+).*USD (?P<value>\d+\.?\d*) per', row, re.I) 
    match.group('symbol') 
    match.group('value')
    

    The (?P<name>...) allows you to name the group.