Search code examples
pythonsplitreadfile

Splitting a string into a list of integers with Python


This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.

For example a file would look like:

unimportant information--------
 unimportant information--------
 -blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------

The output of the method needs to print out a "matrix" in some given form.

So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.

def pssmMatrix(self,ipFileName,directory):
    dir = directory
    filename = ipFileName
    my_lst = []

    #takes every file in fasta folder and put in files list
    for f in os.listdir(dir):
        #splits the file name into file name and its extension
        file, file_ext = os.path.splitext(f)

        if file == ipFileName:
            with open(os.path.join(dir,f)) as file_object:

                for _ in range(3):
                    next(file_object)
                for line in file_object:
                        my_lst.append(' '.join(line.strip().split()))
    return my_lst

Expected results:

['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']

Actual results:

['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'],  [' '], [' unimportant info'], ['unimportant info']  

Solution

  • Try this solution.

        import re
        reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
    
        text = """
        unimportant information--------
    
        unimportant information--------
        -blank line
    
        1 F -1 2 -3 4 5 6 7 (more columns of ints)
    
        2 L 3 -1 3 4 0 -2 1 (more columns of ints)
    
        3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
    
        ignore_start = 5  # 0,1,2,3 =  4
        expected_array = []
        for index, line in enumerate(text.splitlines()):
        if(index >= ignore_start):
                if reg.search(line):
                result = reg.search(line).group(0).strip()
                # Use Result
                expected_array.append(' '.join(result))
    
        print(expected_array)
        # Result: [
        #'- 1   2   - 3   4   5   6   7', 
        #'3   - 1   3   4   0   - 2   1', 
        #'3   - 1   3   6   0   - 2   5'
        #]