python-3.x pandas dataframe string-matching string-parsing

Parse dataframe column for substring and return value

I have dataframe with a column called 'INFO' containing a string delimited by a semicolon like below. I would like to parse the string to return values corresponding to specific substrings and assign it to a new column.

EX. In the table below, I would want to assign the value (after the equals sign) for 'CLNDISDB' to a column of the same name.

I have tried:

df['INFO'].str.split(';',expand=True)

which assigned each string part to a new column but that would still require me to parse each column for the necessary string. Any help would be greatly appreciated.

I can only use python standard libraries


|    | INFO                                              |
|----|---------------------------------------------------|
| 0  | AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;... |
| 1  | AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926... |
| 2  | ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:... |
| 3  | ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:... |
| 4  | AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;... |
| 5  | AF_ESP=0.40158;AF_EXAC=0.37025;AF_TGP=0.33886;... |
| 6  | ALLELEID=556509;CLNDISDB=MedGen:C4015293,OMIM:... |
| 7  | ALLELEID=556512;CLNDISDB=MedGen:C4015293,OMIM:... |
| 8  | ALLELEID=171289;CLNDISDB=MedGen:C4015293,OMIM:... |
| 9  | AF_EXAC=0.00001;ALLELEID=171288;CLNDISDB=MedGe... |
| 10 | AF_ESP=0.00038;AF_EXAC=0.00036;AF_TGP=0.00060;... |
| 11 | AF_ESP=0.00987;AF_EXAC=0.00772;AF_TGP=0.01558;... |
| 12 | AF_ESP=0.00070;AF_TGP=0.00080;ALLELEID=446981;... |
| 13 | AF_EXAC=0.00038;ALLELEID=446941;CLNDISDB=MedGe... |
| 14 | AF_EXAC=0.00114;AF_TGP=0.00958;ALLELEID=364282... |
| 15 | ALLELEID=556516;CLNDISDB=MedGen:C3808739,OMIM:... |
| 16 | AF_EXAC=0.00024;ALLELEID=364148;CLNDISDB=MedGe... |
| 17 | ALLELEID=514900;CLNDISDB=MedGen:C3808739,OMIM:... |
| 18 | AF_EXAC=0.42418;AF_TGP=0.28255;ALLELEID=133759... |
| 19 | ALLELEID=364285;CLNDISDB=MedGen:CN169374;CLNDN... |

Solution

This was fun, so setting up the dataframe

import re
import pandas as pd
import numpy as np

test={'INFO':['AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759', 'foo','AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759']}
df=pd.DataFrame(test)

we get

    INFO
0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759
1   foo
2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759

Now using a lambda function and a conditional we just extract the values for "AF_TGP". Setting up a new column, we extract the number from the regex pattern that surrounds the value. But we only run the regex when it will pass the text, using the in function in python.

df['a'] = df['INFO'].apply(lambda x: re.search(r'AF_TGP=(?P<number>[+-]?\d+\.\d+);', str(x)).group('number') if 'AF_TGP' in str(x) else np.nan)

this gives the result

    INFO                                            a
0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759 -0.28255
1   foo                                             NaN
2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759 0.28255