extract position of char from each row & provide an aggregate of position across a list

I need some python help with this problem. Would appreciate any assistance !. Thanks.

I need an extracted matrix of values enclosed between square brackets. A toy example is below:

File Input will be in a txt file as below:

AB_1 Q[A]IHY[P]GVA

AB_2 Q[G][C]HY[R]GVA

AB_3 Q[G][C]HY[R]GV[D]

Answer out.txt: Script extracts index of char enclosed between sq.brackets "[]" for each row from input and makes an aggregate of index positions for the entire list. The aggregated index is then used to extract all of those positions from input file and produce a matrix as below.

Index 2,3,6,9

AB_1 [A],I,[P],A

AB_2 [G],[C],[R],A

AB_3 [G],[C],[R],[D]

Any help would be greatly appreciated !. Thanks.

Solution

If you want to reduce your table to only those columns in which an entry with square-brackets appears, you can go with this:

import re

def transpose(matrix):
    return [[matrix[j][i] for j in range(len(matrix))] for i in range(len(matrix[0]))]

with open("test_table.txt", "r") as f:
    content = f.read()

rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")]

columns = transpose(rows)
matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)]

matching_rows = transpose(matching_columns)

headline = ["Index {}".format(",".join(matching_rows[0]))]
target_table = headline + ["AB_{0} {1}".format((i + 1), ",".join(line)) for i, line in enumerate(matching_rows[1:])]

with open("out.txt", "w") as f:
    f.write("\n".join(target_table))

First of all you want the content of your .txt file to be represented in arrays. Unfortunately your input data has no seperators yet (as in .csv files) so you need to take care of that. To get a string like this "Q[A]IHY[P]GVA" sorted out I would recommend working with regular expressions.

import re
cells = re.findall(r'(\[.\]|.)', "Q[A]IHY[P]GVA")

The pattern within the r'' string matches a single character within square brackets or just a single character. The re.findall() method returns a list of all matching substrings, in this case: ['Q', '[A]', 'I', 'H', 'Y', '[P]', 'G', 'V', 'A']

rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")] applies this method on every line in your file. The line.split()[1] will leave out the row label "AB_X " as it is not usefull.

Having your data sorted in columns is more fitting, because you want to preserve all columns that match a certain condition (contain an entry in brackets). For this you can just transpose rows. This is done by the transpose() function. If you have imported numpy numpy.transpose(rows) would be the better option I guess.

Next you want to get all columns that satisfy your condition "[" in "".join(column). All done in one line by: matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)] Here [str(i + 1)] does add the column index that you want to use later.

The rest now is easy: Transpose the columns back to rows, relabel the rows, format the row data into strings that fit your desired output format and then write those strings to the out.txt file.