Search code examples
pythonregexextract

How do I extract a certain letter n#s before a specific pattern in a data frame in Python?


I have a column in a dataframe that lists DNA sequences, I would like to do the following two things. Below is an example of the data set

d = [['ampC','tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc'], ['yifL','acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat'],['glyW','tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg']]
df = pd.DataFrame(d, columns = ['gene','Sequence'])
gene Sequence
ampC tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc
yifL acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat
glyW tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg
  1. Extract the capital letter and everything before it. With str.extract(r"(.*?)[A-Z]+", expand=True) I can get everything before the capital letter but I need help figuring out how to get the capital letter as well.

Example of what I'm trying to get for ampC: tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcA

  1. How to extract the 16th letter before the capital letter.

Example of what I'm trying to get for the following 3 genes:

gene letter
ampC c
yifL g
glyW t

[c, g, t]


Solution

  • Your regular expression is almost what you need. Just move the capital letters inside the group. Try with:

    df["substring"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0]
    df["letter"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0].str[-17]
    
    >>> df[["gene", "letter"]]
       gene letter
    0  ampC      c
    1  yifL      g
    2  glyW      t