Search code examples
pythonregextext-files

Selecting all lines/strings that fall between pattern in text file


Given a text file that looks like this when loaded:

>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp

How can I extract all lines that fall between lines that contain '>' and the last lines where there is no ending '>' ?

For example, the result should look like this

result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']

I'm realizing what I did won't work because its looking for text between each new line and '>'. Running this just gives me empty strings.

def findtext(inputtextfile, start, end):
    try:
       pattern=rf'{start}(.*?){end}'
       return re.findall(pattern, inputtextfile)
    except ValueError:
       return -1
result = findtext(inputtextfile,"\n", ">")

Solution

  • Maybe try splitting on rows that start with >, that way you get back a list of the data between and can join those after replacing the \n

    s = """>rice1 1ALBRGHAER
    NNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNN
    >peanuts2 2LAEKaq
    SSSSSSSSSSS
    >OIL3 3hkasUGSV
    ppppppppppppppppppppp
    ppppppppppppppppppppp"""
    
    def findtext(inputtextfile, start, end):
        import re
        try:
            return [''.join(x.replace('\n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
        except ValueError:
            return -1
    

    Trying with your provided case

    findtext(s, '>','\n')
    

    Output

    ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
     'SSSSSSSSSSS',
     'pppppppppppppppppppppppppppppppppppppppppp']