I wrote a script to search in some files where in my pipeline sequences were eliminated. Here is the script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
q = open('eg-not-sec.bait').readlines()
tm = open('eg_tm0_res').readlines()
ph = open('eg_ph01_res').readlines()
secp = open('eg_secp_res').readlines()
tp = open('eg_tp_res').readlines()
ps = open('eg_ps_res').readlines()
gpi = open('eg_es_final_ids').readlines()
nf = open('eg_elim-test', 'a')
for line in q:
if line not in tm:
nf.writelines('%sTMHMM\t'%line)
elif line not in ph:
nf.writelines('%sPH\t'%line)
elif line not in secp:
nf.writelines('%sSECP\t'%line)
elif line not in tp:
nf.writelines('%sTP\t'%line)
elif line not in ps:
nf.writelines('%sPS\t'%line)
elif line not in gpi:
nf.writelines('%sGPI\t'%line)
nf.close()
It would be working perfectly if wasn't for a detail: the sequence ID for the last line is in the first line, and the identification for where it was eliminated is in last line alone, Like this:
EgrG_000049700.1
PH EgrG_000055800.1
PH EgrG_000133800.1
PH EgrG_000221600.1
PH EgrG_000324200.1
PH EgrG_000342900.1
PH EgrG_000391800.1
PH EgrG_000406000.1
PH EgrG_000428150.1
TMHMM EgrG_000430700.1
PH EgrG_000477400.1
PH EgrG_000498000.1
PH EgrG_000502700.1
TMHMM EgrG_000521200.1
PH EgrG_000566700.1
PH EgrG_000633500.1
PH EgrG_000690700.1
PH EgrG_000709300.1
PH EgrG_000823900.1
PH EgrG_000907100.1
PH EgrG_000925400.1
PH EgrG_000974700.1
PH EgrG_001061400.1
PH EgrG_001081300.1
PH EgrG_001136900.1
PH EgrG_001148800.1
PH EgrG_002005100.1
PH EgrG_002026400.1
PH EgrG_002058200.1
PH
It's simple to solve manually just copying the 'PH' in last line and pasting it on first line before the sequence ID, but I'd like to know how to solve this in my code and I'm not figuring out how to do this.
The readlines()
method leaves the newlines at the end of each line in the returned list. So let's take this line of code for example...
nf.writelines('%sPH\t'%line)
This outputs one of your lines, complete with newline at the end. It then puts "PH" and a tab on the next line. And since it outputs no newline of its own, whatever you write next will appear on the same line as the PH.
I think you want something like this:
nf.write("PH\t%s" % line)
to put things in the right order. Also note the use of write
(which outputs a single string) instead of writelines
(which outputs a sequence of strings): you were basically telling Python to output each character of your string individually rather than all at once.