Search code examples
pythonawkseddna-sequencefastq

find a DNA barcode with mismatches in sequence


I have 36-nt reads like this: atcttgttcaatggccgatcXXXXgtcgacaatcaa in the fastq file with XXXX being the different barcodes. I want to search for a barcode in the file at exact position(21 to 24) and print the sequences with up to 3 mismatches in sequence not barcode.

For example: I have barcode: aacg search that barcode between position 21 to 24 in fastq file with allowing 3 mismatches in the sequence like:

atcttgttcaatggccgatcaacggtcgacaatcac # it has 1 mismatch
ttcttgttcaatggccgatcaacggtcgacaatcac # it has 2 mismatch
tccttgttcaatggccgatcaacggtcgacaatcac # it has 3 mismatch

I was trying to find unique lines first using awk and look for mismatches but it is very tedious for me to look and find them.

awk 'NR%4==2' 1.fq |sort|uniq -c|awk '{print $1"\t"$2}' > out1.txt

Is there any quick way i can find?

Thank you.


Solution

  • Using Python:

    strs = "atcttgttcaatggccgatcaacggtcgacaatcaa"
    
    with open("1.fq") as f:
        for line in f:
            if line[20:24] == "aacg":
                line = line.strip()
                mismatches = sum(x!=y for x, y  in zip(strs, line))
                if mismatches <= 3:
                    print line, mismatches
    
    atcttgttcaatggccgatcaacggtcgacaatcac 1
    ttcttgttcaatggccgatcaacggtcgacaatcac 2
    tccttgttcaatggccgatcaacggtcgacaatcac 3