Search code examples
pythonstringparsingbioinformaticsgenbank

how to get a sequence after a word with whitespace


For school I have to parse a string after a word with a lot of whitespace, but I just can't get it. Because the file is a genbank.

So for example:

BLA                                                                                                             
      1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
      2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
      3 kahsfkjshakjfhksjhfkskjfkaskfksj

//

What I have tried is this.

if line.startswith("BLA"):

       start = line.find("BLA")
       end = line.find("//")
       line = line[:end]
       s_string = ""
       string = list()
       if s_string:
           string.append(line)


        else:
            line = line.strip()
            my_seq += line

But what I get is:

**output**
BLA

and that is the only thing it get and I want to get the output be like

**output**
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
    2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
    3 kahsfkjshakjfhksjhfkskjfkaskfksj

So I don't know what to do, I tried to get it like that last output. But without success. My teacher told me that I had to do like. If BLA is True you can go iterate it. And if you see "//" you have to stop, but when I tried it with that True - statement I get nothing.

I tried to search it up online, and it said I had to do it with bio seqIO. But the teacher said we can't use that.


Solution

  • Here is my solution:

    lines = """BLA
      1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
      2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
      3 kahsfkjshakjfhksjhfkskjfkaskfksj
    
    //"""
    
    lines = lines.strip().split("//")
    lines = lines[0].split("BLA")
    lines = [i.strip() for i in lines]
    print("BLA", " ", lines[1])
    

    Output:

    BLA   1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
          2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
          3 kahsfkjshakjfhksjhfkskjfkaskfksj