Search code examples
python-3.xpandasdataframestring-matchingstring-parsing

Parse dataframe column for substring and return value


I have dataframe with a column called 'INFO' containing a string delimited by a semicolon like below. I would like to parse the string to return values corresponding to specific substrings and assign it to a new column.

EX. In the table below, I would want to assign the value (after the equals sign) for 'CLNDISDB' to a column of the same name.

I have tried:

df['INFO'].str.split(';',expand=True)

which assigned each string part to a new column but that would still require me to parse each column for the necessary string. Any help would be greatly appreciated.

I can only use python standard libraries


|    | INFO                                              |
|----|---------------------------------------------------|
| 0  | AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;... |
| 1  | AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926... |
| 2  | ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:... |
| 3  | ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:... |
| 4  | AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;... |
| 5  | AF_ESP=0.40158;AF_EXAC=0.37025;AF_TGP=0.33886;... |
| 6  | ALLELEID=556509;CLNDISDB=MedGen:C4015293,OMIM:... |
| 7  | ALLELEID=556512;CLNDISDB=MedGen:C4015293,OMIM:... |
| 8  | ALLELEID=171289;CLNDISDB=MedGen:C4015293,OMIM:... |
| 9  | AF_EXAC=0.00001;ALLELEID=171288;CLNDISDB=MedGe... |
| 10 | AF_ESP=0.00038;AF_EXAC=0.00036;AF_TGP=0.00060;... |
| 11 | AF_ESP=0.00987;AF_EXAC=0.00772;AF_TGP=0.01558;... |
| 12 | AF_ESP=0.00070;AF_TGP=0.00080;ALLELEID=446981;... |
| 13 | AF_EXAC=0.00038;ALLELEID=446941;CLNDISDB=MedGe... |
| 14 | AF_EXAC=0.00114;AF_TGP=0.00958;ALLELEID=364282... |
| 15 | ALLELEID=556516;CLNDISDB=MedGen:C3808739,OMIM:... |
| 16 | AF_EXAC=0.00024;ALLELEID=364148;CLNDISDB=MedGe... |
| 17 | ALLELEID=514900;CLNDISDB=MedGen:C3808739,OMIM:... |
| 18 | AF_EXAC=0.42418;AF_TGP=0.28255;ALLELEID=133759... |
| 19 | ALLELEID=364285;CLNDISDB=MedGen:CN169374;CLNDN... |


Solution

  • This was fun, so setting up the dataframe

    import re
    import pandas as pd
    import numpy as np
    
    test={'INFO':['AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759', 'foo','AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759']}
    df=pd.DataFrame(test)
    

    we get

        INFO
    0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759
    1   foo
    2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759
    

    Now using a lambda function and a conditional we just extract the values for "AF_TGP". Setting up a new column, we extract the number from the regex pattern that surrounds the value. But we only run the regex when it will pass the text, using the in function in python.

    df['a'] = df['INFO'].apply(lambda x: re.search(r'AF_TGP=(?P<number>[+-]?\d+\.\d+);', str(x)).group('number') if 'AF_TGP' in str(x) else np.nan)
    

    this gives the result

        INFO                                            a
    0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759 -0.28255
    1   foo                                             NaN
    2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759 0.28255