Search code examples
pythonregexsplitheaderfasta

str.split by regex (complex pattern)


How do I split the ID from annotation by using regex in the data frame below?

df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})

So the columns supposed to be like this:

df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})

The following code doesn't work properly.

df.join(df["header"].str.split(r'\d+', 0, expand=True))

Thanks in advance!!


Solution

  • You can split with one or more whitespaces between a digit and a letter:

    df[['id','header']] = df['header'].str.split(r'(?<=\d)\s+(?=[A-Z])', n=1, expand=True)
    

    Or, you may capture the ID pattern into one group and the rest into another:

    df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9]+_[A-Z0-9]+)\s+(.*)', expand=True)
    

    Or, you may simply Series.str.split with the first whitespace chunk:

    df[['id', 'header']] = df['header'].str.split("\s+", n=1, expand=True)
    

    Output:

    >>> df
                                  header             id
    0  All-trans-retinol 13,14-reductase  SS50377_28860
    

    Details:

    • (?<=\d)\s+(?=[A-Z]) - matches one or more whitespaces (\s+) that are immediately preceded with a digit ((?<=\d)) and immediately followed with an uppercase ASCII letter ([A-Z])
    • ^([A-Z0-9]+_[A-Z0-9]+)\s+(.*) - matches start of string (^), then captures one or more uppercase ASCII letters or digits, _ and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces (\s+) and then captures the rest of the line into Group 2 (with (.*)).

    Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.