Search code examples
pandasawktext-extraction

Extracting lines based on comma-separated string in another file and write extracted lines to file


I have a tab-delimited file that lists the genomeID in the first column and its respective contigIDs. The contigIDs are comma-separated within the second column (example below)

424182.1        H|S1|C933685,H|S1|C449562,H|S1|C172291,H|S1|C1169825
1217675.1       H|S1|C1168525,H|S1|C573086,H|S1|C357867,H|S1|C85072,H|S1|C965427,H|S1|C1724718
585503.1        H|S1|C874141,H|S1|C529585

I have another file called SAMPLE.fasta that contains contigIDs and the respective sequences in the next line for each contigID (example below)

>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
etc...
etc...
etc..

Based on this information, I would like to create a separate file for each genomeID (example(s) below)

Output_file: 424182.1.fasta

>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT

Output_file: 1217675.1.fasta

>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG

Would appreciate any help with awk and/or pandas. Thank you in advance for your help with this!


Solution

  • Assuming the sequence data ends in a single line (without extending over multiple lines), how about an awk solution:

    awk -F'\t' '
        NR==FNR {                                   # process SAMPLE.fasta file
            if (FNR % 2) {                          # odd line with contigID
                len = split($0, a, "|")             # extract the contigID
                id = a[len]
                seq[id] = $0                        # assign seq[id] to the line
            } else {                                # even line with sequence
                seq[id] = seq[id] RS $0             # append sequence to seq[id]
            }
            next
        }
        {                                           # process contigIDs file
            fname = $1 ".fasta"                     # filename to write
            len = split($2, a, ",")                 # split the contigIDs
            for (i = 1; i <= len; i++) {
                split(a[i], b, "|")                 # extract the contigID
                if (b[3] in seq) {                  # if the sequence is found
                    print seq[b[3]] > fname         # then print it to the file
                }
            }
            close(fname)
        }
    ' SAMPLE.fasta contigIDs
    

    Output:

    424182.1.fasta file:
    >H|S1|C933685
    GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
    
    1217675.1.fasta file:
    >H|S1|C85072
    GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
    >H|S1|C965427
    CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG