I have two files:
I want to loop through both files, if a sequence name in the fasta file is also in either the 'Hcolumn' or 'Kcolumn', then add the sequence to the corresponding sequence column in the excel file, which is either the 'Hsequence' or the 'Ksequence' column. My idea here is to get the location of the vh value, and add its seq value to the cell on its right (the next row).
Update with examples
fasta file (ignoring the > sign):
headerA
AAAGGCCT
headerB
ATCCTTTG
headerC
GGGGTCCCAAT
excel sheet
Hcolumn Hsequence Kcolumn Ksequence
headerA headerB
headerC headerE
headerD headerF
desired outcome:
Hcolumn Hsequence Kcolumn Ksequence
headerA AAAGGCCT headerB ATCCTTTG
headerC GGGGTCCCAAT headerE
headerD headerF
my new code: (my code below only deals with one column, i don't yet know how to say if it's in columnA OR in columnB)
for line in f:
seq = f.readline()
for vh in data["Hcolumn"]:
vh = vh.rstrip()
if (title in vh):
#print(">"+title, seq)
data["HSequence"]=data["Hcolumn"].apply(lambda vh: seq)
else:
data["HSequence"]=data["Hcolumn"].apply(lambda vh: '')
break
return data.to_csv('Fullseqfile.txt', sep='\t')
the sequence column returned null, can someone help thanks
In [1]: import pandas as pd
In [2]: !cat data/question1/file.fasta
headerA
AAAGGCCT
headerB
ATCCTTTG
headerC
GGGGTCCCAAT
In [3]: xls=pd.read_excel('file.xls')
In [4]: xls
Out[4]:
Hcolumn Hsequence Kcolumn Ksequence
0 headerA NaN headerB NaN
1 headerC NaN headerE NaN
2 headerD NaN headerF NaN
In [5]: fh = open('file.fasta')
...: fasta_dic={}
...: for line in fh:
...: if line.startswith('h'):
...: seq_header = line.strip('\n')
...: fasta_dic[seq_header] = ''
...: else:
...: fasta_dic[seq_header] = line.strip('\n')
...:
In [6]: def fill_seq(x):
...: if x in fasta_dic.keys():
...: return fasta_dic[x]
...: else:
...: return ''
...:
In [7]: xls['Hsequence'] = xls['Hcolumn'].apply(fill_seq)
...: xls['Ksequence'] = xls['Kcolumn'].apply(fill_seq)
...:
In [8]: xls
Out[8]:
Hcolumn Hsequence Kcolumn Ksequence
0 headerA AAAGGCCT headerB ATCCTTTG
1 headerC GGGGTCCCAAT headerE
2 headerD headerF
Builds a dictionary fasta_dic
with the sequences names as key and the sequences as value.
The function fill_seq
checks if the input x
it's in the dictionary you defined before, if it finds the value it will return the sequence.
You apply the function fill_seq
to the H/K sequence columns using the values from H/K sequence as input.
After this you can keep working with your dataframe or export it to a xls file.