Search code examples
pythonregexstringsplitprotein-database

Cut within a pattern using Python regex


Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.

What I am looking for:

I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.

Pattern: CDE|FG

String: ABCDEFGHIJKLMNOCDEFGZYPE

Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

What I have tried:

I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.

re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')

Gives,

['AB', 'HIJKLMNO', 'ZYPE']

When I actually need,

['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

Motivation:

Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.


Solution

  • A non regex way would be to replace the pattern with the piped value and then split.

    >>> pattern = 'CDE|FG'
    >>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
    >>> s.replace('CDEFG',pattern).split('|')
    ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']