Search code examples
pythonfindsubstr

python substrn cells in a column dataframe


I have this data frame with this kind of column:

hmtl
<div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>

I need to clean this up and leave from "DCG_" up to where "</div>" begins:

DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG

Most of the cells in this column vary where the "DCG_" is located as well as the "</div>", I'm trying to use the following code line for this:

df['html'] = df['html'].str[df['html'.str.find('DCG_':]

but it just returns all null


Solution

  • Use pd.Series.str.extract, where you specify a regular expression and extract anything in any capture groups in the first match:

    >>> df['extracted'] = df['html'].str.extract("(DCG_.*?)</div>")
    >>> df.to_dict()
    

    which gives:

    
    {'html': {0: '<div data-wrapper="true" style="font-size:9pt;font-family:\'Segoe UI\',\'Helvetica Neue\',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>'},
     'extracted': {0: 'DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG'}}
    

    Regex explanation Try it online:

    (DCG_.*?)</div>
    (       )        : Capturing group
     DCG_            : Literally DCG_
         .*?         : Zero or more of any character, lazy match
             </div>  : Literally </div>