Search code examples
python-3.xfastq

How can I write matching lines between two files as well as the line which follows a match


I have two fastq files that I want to compare. The structure of the fastq files are as follows (per read):

  • Line 1 = The header, containing the read ID. This line always begins with '@'.
  • Line 2 = The sequence of bases for that particular read (A, G, C, or T)
  • Lines 3 & 4 = Contain quality score information for the read, which I don't need.

The two files I am working with are:

  • File1 = Contains specific sequences of bases which I am interested in, along with their corresponding header lines (2 lines per read)

  • File2 = A normal fastq file for comparison (4 lines per read)

I need to compare file 1 and 2 to find all matching header lines, then save the corresponding sequence lines from file2.

Where file 1:

@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA

And file 2:

@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGN
+
#1=BDDFFHHHFHIJJJJJJJJJJJJJJJJJJJJJIJJIJJJJJHJIIJHGIJJJJJJIHJJBGHJHIIJJJHHHHFFFFEEEDD;?BACDDDA?@CDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACNNN
+
#1=DDDFDFHHHGHIIGJJJJHIJIHHDIHHIJGGEI@GFGHIHIJHEFHIIIIGIJGHHGECFGIDHGIHIIEGIIJHHEEFFF7?ACEECCBBDEDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1200:1972 1:N:0:GGCTAC
NTACGTTTAGTAGAGACAGTGTCTTGCTATGTTGCCCAGGCTGGTCTCAAACTCCTGAGCTCTAGCAAGCCTTCCACCTCTGCCTCCCAGTGTTCTGGGAT
+
#1=DDDDFHHHBHGIGIIJHCDHHIJJJHEGFIIHFHGEGHJEIFHHHEFHHGIGIJEHIIJJJJIJIJIJGIIH.?CEFFFFDCEDD3>>@CDDDDDD<@

In other words, for every matching header line, I want to save its partnered sequence line.

Finally, I want to write all these headers and sequences to a new file, for downstream analysis.

My current output is:

@HWI-D00461:137:C9H2FACXX:3:1101:1357:1984 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1755:2000 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1260:1977 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1917:1984 1:N:0:GGCTAC

My desired output would be:

@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA

Here is what I have so far.

ids = ''
with open(no_adapter_file, 'r') as file1:
    with open(comparison_file, 'r') as file2:
        common = set(file1).intersection(file2)
        for line in common:
            if line[0] == '@'
                ids += line

with open(comparison_file, 'r') as file2:
    ids_seq = ''
    for line in file2:
        if line == ids:
            line += ids_seq

with open(new_file, 'w') as file_out:
    for line in ids_seq:
        file_out.write(line)
print(new_file + " was created.")

The code successfully pulls out all matching header lines, but I don't know how to pull out the sequence lines which follow.

EDIT: Added some examples of input and expected output.


Solution

  • There are several ways to tackle your problem. This is one that assumes nothing about the sorting of the files (which could be quite useful to avoid the potentially high memory consumption). It also assumes nothing about the possibility of having duplicate IDs (which could also reduce the memory footprint in case the reads are not guaranteed to be in the same order).

    # First, get the IDs we are interested in
    ids = set()
    with open(no_adapter_file, 'r') as file1:
        line_number = 0
        for line in file1:
            line_number += 1
            if line_number % 2 == 1:
                ids.add(line.rstrip())
    
    # Now, read the second file and print desired info if the ID is
    # one of the interesting ones
    with open(comparison_file, 'r') as file2, open(new_file, 'w') as file_out:
        line_number = 0
        for line in file2:
            line_number += 1
            if line_number % 4 == 1:
                candidate_id = line.rstrip()
            elif line_number % 4 == 2:
                candidate_seq = line.rstrip()
                if candidate_id in ids:
                    file_out.write("{}\n{}\n".format(candidate_id, candidate_seq))
    
    print(new_file + " was created.")
    

    In this code, we first read the first file and store the IDs in a set. Then, this set is used to check if the reads coming from the second file must be written or not.