Search code examples
pythonlistreadfilesplicedna-sequence

Read a text file into python by splitting the file into list items according to a set of characters


I have a plain text file with the following contents:

@M00964: XXXXX
YYY
+
ZZZZ 
@M00964: XXXXX
YYY
+
ZZZZ
@M00964: XXXXX
YYY
+
ZZZZ

and I would like to read this into a list split into items according to the ID code @M00964, i.e. :

['@M00964: XXXXX
YYY
+
ZZZZ' 
'@M00964: XXXXX
YYY
+
ZZZZ'
'@M00964: XXXXX
YYY
+
ZZZZ']

I have tried using

in_file = open(fileName,"r")
sequences = in_file.read().split('@M00964')[1:]
in_file.close()

but this removes the ID sequence @M00964. Is there any way to keep this ID sequence in?

As an additional question is there any way of maintaining white space in a list (rather than have /n symbols).

My overall aim is to read in this set of items, take the first 2, for example, and write them back to a text file maintaining all of the original formatting.


Solution

  • Specific to your example, can't you just do something as follows:

    in_file = open(fileName, 'r')
    file = in_file.readlines()
    
    new_list = [''.join(file[i*4:(i+1)*4]) for i in range(int(len(file)/4))]
    list_no_n = [item.replace('\n','') for item in new_list]
    
    print new_list
    print list_no_n
    

    [EXPANDED FORM]

    new_list = []
    for i in range(int(len(file)/4)): #Iterates through 1/4 of the length of the file lines.
                                      #This is because we will be dealing in groups of 4 lines
        new_list.append(''.join(file[i*4:(i+1)*4])) #Joins four lines together into a string and adds it to the new_list
    

    [Writing to new file]

    write_list = ''.join(new_list).split('\n')
    output_file = open(filename, 'w')
    output_file.writelines(write_list)