I have 36-nt reads like this: atcttgttcaatggccgatcXXXXgtcgacaatcaa
in the fastq file
with XXXX being the different barcodes. I want to search for a barcode in the file at exact position(21 to 24) and print the sequences with up to 3 mismatches in sequence not barcode.
For example:
I have barcode: aacg
search that barcode between position 21 to 24 in fastq file with allowing 3 mismatches in the sequence like:
atcttgttcaatggccgatcaacggtcgacaatcac # it has 1 mismatch
ttcttgttcaatggccgatcaacggtcgacaatcac # it has 2 mismatch
tccttgttcaatggccgatcaacggtcgacaatcac # it has 3 mismatch
I was trying to find unique lines first using awk and look for mismatches but it is very tedious for me to look and find them.
awk 'NR%4==2' 1.fq |sort|uniq -c|awk '{print $1"\t"$2}' > out1.txt
Is there any quick way i can find?
Thank you.
Using Python:
strs = "atcttgttcaatggccgatcaacggtcgacaatcaa"
with open("1.fq") as f:
for line in f:
if line[20:24] == "aacg":
line = line.strip()
mismatches = sum(x!=y for x, y in zip(strs, line))
if mismatches <= 3:
print line, mismatches
atcttgttcaatggccgatcaacggtcgacaatcac 1
ttcttgttcaatggccgatcaacggtcgacaatcac 2
tccttgttcaatggccgatcaacggtcgacaatcac 3