Search code examples
bashawkgrep

Grep a row and retain only matching rows based on a specific column in bash


I'm working on a project where I need to extract specific gene neighborhood information from a dataset based on protein accession IDs. I have a series of .acc files containing protein accession IDs, and I need to perform the following tasks:

  1. Read each accession ID from the .acc file.
  2. Search for the accession in a corresponding gene neighborhood file, which contains a large amount of data in a specific format. The relevant lines of this file look like this:
7897    Latimeria_chalumnae 102356403   FLII    FLII_actin_remodeling_protein   275088  356473  -       NW_005819398.1  XP_014344771.1
7897    Latimeria_chalumnae 102352677   MIEF2   mitochondrial_elongation_factor_2   356900  369656  +       NW_005819398.1  XP_005997516.1|XP_005997517.1|XP_014344772.1
7897    Latimeria_chalumnae 102352038   TOP3A   DNA_topoisomerase_III_alpha 381055  421312  -       NW_005819398.1  XP_005997514.1|XP_014344773.1
7897    Latimeria_chalumnae 102351764   SMCR8A  Smith-Magenis_syndrome_chromosome_region,_candidate_8a  421448  432658  +       NW_005819398.1  XP_005997512.1
7897    Latimeria_chalumnae 102353118   SHMT1   serine_hydroxymethyltransferase_1_(soluble) 453487  501768  -       NW_005819398.1  XP_005997518.2
7897    Latimeria_chalumnae 102353382   PRPSAP2 phosphoribosyl_pyrophosphate_synthetase-associated_protein_2    511591  561160  +       NW_005819398.1  XP_005997519.1|XP_005997520.1|XP_005997521.1|XP_005997522.1|XP_014344774.1|XP_014344775.1
7897    Latimeria_chalumnae 106703785   SLC5A10 solute_carrier_family_5_member_10   577897  737478  +       NW_005819398.1  XP_014344776.1
7897    Latimeria_chalumnae 102354157   FAM83GA family_with_sequence_similarity_83_member_Ga    653729  709964  -       NW_005819398.1  XP_005997523.1
7897    Latimeria_chalumnae 102356925   LOC102356925    uncharacterized_protein_KIAA0195-like   818471  943407  +       NW_005819398.1  XP_014344777.1
7897    Latimeria_chalumnae 102357171   LOC102357171    uncharacterized_protein_KIAA0195-like   971087  1041031 +       NW_005819398.1  XP_014344778.1
7897    Latimeria_chalumnae 106703788   LOC106703788    zinc_finger_MYM-type_protein_1-like 1095688 1096413 -       NW_005819398.1  XP_014344779.1
7897    Latimeria_chalumnae 102354700   EPN2    epsin_2 1150804 1258003 +       NW_005819398.1  XP_005997525.1|XP_005997526.1
7897    Latimeria_chalumnae 102355121   LOC102355121    melanin-concentrating_hormone_receptor_1-like   1369192 1370937 +       NW_005819398.1  XP_005997527.1
7897    Latimeria_chalumnae 102355380   B9D1    B9_protein_domain_1 1397029 1461568 -       NW_005819398.1  XP_005997528.1|XP_005997529.1
7897    Latimeria_chalumnae 102358598   LOC102358598    peroxiredoxin-2-like    2491    40092   -       NW_005819399.1  XP_014344786.1
7897    Latimeria_chalumnae 102357443   S1PR5A  sphingosine-1-phosphate_receptor_5a 44525   60866   -       NW_005819399.1  XP_005997536.1
7897    Latimeria_chalumnae 102357702   ZBTB40  zinc_finger_and_BTB_domain_containing_40    331820  386431  -       NW_005819399.1  XP_014344780.1|XP_014344781.1|XP_014344782.1
7897    Latimeria_chalumnae 102358331   WNT4    Wnt_family_member_4 1146510 1165850 +       NW_005819399.1  XP_014344784.1
7897    Latimeria_chalumnae 102358869   CDC42   cell_division_cycle_42  1367899 1453373 -       NW_005819399.1  XP_014344785.1
7897    Latimeria_chalumnae 102359147   LOC102359147    phosphoserine_aminotransferase-like 1152    12401   -       NW_005819400.1  XP_005997543.2
7897    Latimeria_chalumnae 102360660   LOC102360660    phosphoserine_aminotransferase-like 15454   41930   -       NW_005819400.1  XP_014344787.1
7897    Latimeria_chalumnae 102360916   CEP78   centrosomal_protein_78  69138   156223  -       NW_005819400.1  XP_005997550.1
7897    Latimeria_chalumnae 102359416   LOC102359416    guanine_nucleotide-binding_protein_G(q)_subunit_alpha   239201  464669  +       NW_005819400.1  XP_005997544.1|XP_005997545.1
7897    Latimeria_chalumnae 102359865   LOC102359865    guanine_nucleotide-binding_protein_subunit_alpha-14-like    532847  621010  +       NW_005819400.1  XP_005997546.2
7897    Latimeria_chalumnae 102360136   LOC102360136    guanine_nucleotide-binding_protein_subunit_alpha-14 728622  742912  +       NW_005819400.1  XP_005997547.1
7897    Latimeria_chalumnae 102361174   VPS13A  vacuolar_protein_sorting_13_homolog_A   760611  959552  -       NW_005819400.1  XP_014344789.1
7897    Latimeria_chalumnae 102361432   LOC102361432    C2_calcium-dependent_domain-containing_protein_4C-like  971747  987261  -       NW_005819400.1  XP_014344790.1
7897    Latimeria_chalumnae 102360402   FOXB2   forkhead_box_B2 1123429 1124373 -       NW_005819400.1  XP_005997548.1
7897    Latimeria_chalumnae 106703789   LOC106703789    protein_prune_homolog_2-like    1225384 1483856 +       NW_005819400.1  XP_014344791.1
7897    Latimeria_chalumnae 102361698   TBC1D4  TBC1_domain_family,_member_4    313607  553482  -       NW_005819401.1  XP_005997553.1|XP_005997555.1|XP_005997556.1|XP_005997561.1|XP_014344792.1|XP_014344793.1|XP_014344794.1
7897    Latimeria_chalumnae 102367238   COMMD6  COMM_domain_containing_6    567595  585926  -       NW_005819401.1  XP_014344795.1
  1. For each protein accession, I want to: a. Retrieve the matching line and the 5 lines above and below it. b. From this subset, filter to keep only those lines where the 9th column (the genomic accession) matches the genomic accession of the main line (the one containing the protein accession).

Example

Suppose I have a protein accession 'XP_005997536.1' in the .acc file and I use grep to extract that row. The relevant lines surrounding this accession might look like this:

7897    Latimeria_chalumnae 106703788   LOC106703788    zinc_finger_MYM-type_protein_1-like 1095688 1096413 -       NW_005819398.1  XP_014344779.1
7897    Latimeria_chalumnae 102354700   EPN2    epsin_2 1150804 1258003 +       NW_005819398.1  XP_005997525.1|XP_005997526.1
7897    Latimeria_chalumnae 102355121   LOC102355121    melanin-concentrating_hormone_receptor_1-like   1369192 1370937 +       NW_005819398.1  XP_005997527.1
7897    Latimeria_chalumnae 102355380   B9D1    B9_protein_domain_1 1397029 1461568 -       NW_005819398.1  XP_005997528.1|XP_005997529.1
7897    Latimeria_chalumnae 102358598   LOC102358598    peroxiredoxin-2-like    2491    40092   -       NW_005819399.1  XP_014344786.1
**7897  Latimeria_chalumnae 102357443   S1PR5A  sphingosine-1-phosphate_receptor_5a 44525   60866   -       NW_005819399.1  XP_005997536.1**
7897    Latimeria_chalumnae 102357702   ZBTB40  zinc_finger_and_BTB_domain_containing_40    331820  386431  -       NW_005819399.1  XP_014344780.1|XP_014344781.1|XP_014344782.1
7897    Latimeria_chalumnae 102358331   WNT4    Wnt_family_member_4 1146510 1165850 +       NW_005819399.1  XP_014344784.1
7897    Latimeria_chalumnae 102358869   CDC42   cell_division_cycle_42  1367899 1453373 -       NW_005819399.1  XP_014344785.1
7897    Latimeria_chalumnae 102359147   LOC102359147    phosphoserine_aminotransferase-like 1152    12401   -       NW_005819400.1  XP_005997543.2
7897    Latimeria_chalumnae 102360660   LOC102360660    phosphoserine_aminotransferase-like 15454   41930   -       NW_005819400.1  XP_014344787.1

From this, I want to filter out any lines that do not contain NW_005819399.1 in the 9th column.

What I've Tried

I attempted to write a bash script to achieve this, but I'm running into issues with filtering the results correctly and ensuring that the output format remains clean. Here's a snippet of my current approach:

#!/usr/bin/bash

# Function to process each line of the file
process_line() {
    local line="$1"
    local file="$2"
    local index="$3"

    # Extract the protein accession (10th column) and genomic accession (9th column)
    local protein_accession=$(echo "$line" | awk '{print $10}')
    local genomic_accession=$(echo "$line" | awk '{print $9}')

    # Get the surrounding lines based on the protein accession
    local temp_lines=$(grep -w -m 1 -A5 -B5 "$protein_accession" gene_neighborhood.final)

    # Save the surrounding lines to a temporary file
    echo "$temp_lines" > "/tmp/temp_neighborhood_$index.txt"

    # Filter the lines based on genomic accession presence
    awk -v acc="$genomic_accession" '$9 == acc' /tmp/temp_neighborhood_$index.txt >> "${file}.${index}.gene.neighborhood"
}

# Function to process each file
process_file() {
    local file="$1"
    local index=1

    while IFS= read -r line; do
        process_line "$line" "$file" "$index"
        ((index++))
    done < "$file"

    # Clean up temporary files
    rm /tmp/temp_neighborhood_*.txt
}

# Main script
for file in *.acc; do
    process_file "$file"
done

What I was expecting:

7897    Latimeria_chalumnae     102358598       LOC102358598    peroxiredoxin-2-like    2491    40092   -       NW_005819399.1  XP_014344786.1
7897    Latimeria_chalumnae     102357443       S1PR5A  sphingosine-1-phosphate_receptor_5a     44525   60866   -       NW_005819399.1  XP_005997536.1
7897    Latimeria_chalumnae     102357702       ZBTB40  zinc_finger_and_BTB_domain_containing_40        331820  386431  -       NW_005819399.1  XP_014344780.1|XP_014344781.1|XP_014344782.1
7897    Latimeria_chalumnae     102358331       WNT4    Wnt_family_member_4     1146510 1165850 +       NW_005819399.1  XP_014344784.1
7897    Latimeria_chalumnae     102358869       CDC42   cell_division_cycle_42  1367899 1453373 -       NW_005819399.1  XP_014344785.1

What I got as an output:

XP_005997536.1

Can anyone suggest a more reliable approach or provide a full example of how to implement this in bash? I’m looking for clear steps, any potential pitfalls, and best practices for processing text files in this manner.


Solution

  • From your description of your requirements it sounds like this is what you're trying to do, using a 2-pass approach with any awk:

    $ awk '
        NR==FNR { if (/XP_005997536\.1/) {beg=NR-5; end=NR+5; bad=$9} next }
        (beg <= FNR) && (FNR <= end) && ($9 != bad)
    ' file file
    7897    Latimeria_chalumnae 106703788   LOC106703788    zinc_finger_MYM-type_protein_1-like 1095688 1096413 -       NW_005819398.1  XP_014344779.1
    7897    Latimeria_chalumnae 102354700   EPN2    epsin_2 1150804 1258003 +       NW_005819398.1  XP_005997525.1|XP_005997526.1
    7897    Latimeria_chalumnae 102355121   LOC102355121    melanin-concentrating_hormone_receptor_1-like   1369192 1370937 +       NW_005819398.1  XP_005997527.1
    7897    Latimeria_chalumnae 102355380   B9D1    B9_protein_domain_1 1397029 1461568 -       NW_005819398.1  XP_005997528.1|XP_005997529.1
    7897    Latimeria_chalumnae 102359147   LOC102359147    phosphoserine_aminotransferase-like 1152    12401   -       NW_005819400.1  XP_005997543.2
    7897    Latimeria_chalumnae 102360660   LOC102360660    phosphoserine_aminotransferase-like 15454   41930   -       NW_005819400.1  XP_014344787.1
    

    but that doesn't produce the expected output you posted so something isn't clear about your requirements for this part of your problem. Your example is also missing 1 of the 2 types of input file you mention so there's no point me trying to investigate the differences yet BUT I think this is what you're really trying to do overall (foo.acc contains the target list of ids, one per line, and bar.gen is the sample input file provided in the question):

    $ head foo.acc bar.gen
    ==> foo.acc <==
    XP_005997536.1
    XP_005997543.2
    
    ==> bar.gen <==
    7897    Latimeria_chalumnae 102356403   FLII    FLII_actin_remodeling_protein   275088  356473  -       NW_005819398.1  XP_014344771.1
    7897    Latimeria_chalumnae 102352677   MIEF2   mitochondrial_elongation_factor_2   356900  369656  +       NW_005819398.1  XP_005997516.1|XP_005997517.1|XP_014344772.1
    7897    Latimeria_chalumnae 102352038   TOP3A   DNA_topoisomerase_III_alpha 381055  421312  -       NW_005819398.1  XP_005997514.1|XP_014344773.1
    7897    Latimeria_chalumnae 102351764   SMCR8A  Smith-Magenis_syndrome_chromosome_region,_candidate_8a  421448  432658  +       NW_005819398.1  XP_005997512.1
    7897    Latimeria_chalumnae 102353118   SHMT1   serine_hydroxymethyltransferase_1_(soluble) 453487  501768  -       NW_005819398.1  XP_005997518.2
    7897    Latimeria_chalumnae 102353382   PRPSAP2 phosphoribosyl_pyrophosphate_synthetase-associated_protein_2    511591  561160  +       NW_005819398.1  XP_005997519.1|XP_005997520.1|XP_005997521.1|XP_005997522.1|XP_014344774.1|XP_014344775.1
    7897    Latimeria_chalumnae 106703785   SLC5A10 solute_carrier_family_5_member_10   577897  737478  +       NW_005819398.1  XP_014344776.1
    7897    Latimeria_chalumnae 102354157   FAM83GA family_with_sequence_similarity_83_member_Ga    653729  709964  -       NW_005819398.1  XP_005997523.1
    7897    Latimeria_chalumnae 102356925   LOC102356925    uncharacterized_protein_KIAA0195-like   818471  943407  +       NW_005819398.1  XP_014344777.1
    7897    Latimeria_chalumnae 102357171   LOC102357171    uncharacterized_protein_KIAA0195-like   971087  1041031 +       NW_005819398.1  XP_014344778.1
    

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk -v OFS='\t' '
        FNR == 1 {
            argind++
        }
        argind == 1 {
            tgt_accs[$1]
        }
        argind == 2 {
            split($10,cur_accs,"|")
            for (i in cur_accs) {
                acc = cur_accs[i]
                if ( acc in tgt_accs ) {
                    begs[acc] = FNR - 5
                    ends[acc] = FNR + 5
                    bads[acc] = $9
                }
            }
        }
        argind == 3 {
            for ( acc in begs ) {
                beg = begs[acc]
                end = ends[acc]
                bad = bads[acc]
                if ( (beg <= FNR) && (FNR <= end) && ($9 != bad) ) {
                    print FNR, acc, $0
                }
            }
        }
    ' "$1" "$2" "$2" |
    sort -k2,2 -k1,1n |
    cut -f2-
    

    $ ./tst.sh foo.acc bar.gen
    XP_005997536.1  7897    Latimeria_chalumnae 106703788   LOC106703788    zinc_finger_MYM-type_protein_1-like 1095688 1096413 -       NW_005819398.1  XP_014344779.1
    XP_005997536.1  7897    Latimeria_chalumnae 102354700   EPN2    epsin_2 1150804 1258003 +       NW_005819398.1  XP_005997525.1|XP_005997526.1
    XP_005997536.1  7897    Latimeria_chalumnae 102355121   LOC102355121    melanin-concentrating_hormone_receptor_1-like   1369192 1370937 +       NW_005819398.1  XP_005997527.1
    XP_005997536.1  7897    Latimeria_chalumnae 102355380   B9D1    B9_protein_domain_1 1397029 1461568 -       NW_005819398.1  XP_005997528.1|XP_005997529.1
    XP_005997536.1  7897    Latimeria_chalumnae 102359147   LOC102359147    phosphoserine_aminotransferase-like 1152    12401   -       NW_005819400.1  XP_005997543.2
    XP_005997536.1  7897    Latimeria_chalumnae 102360660   LOC102360660    phosphoserine_aminotransferase-like 15454   41930   -       NW_005819400.1  XP_014344787.1
    XP_005997543.2  7897    Latimeria_chalumnae 102358598   LOC102358598    peroxiredoxin-2-like    2491    40092   -       NW_005819399.1  XP_014344786.1
    XP_005997543.2  7897    Latimeria_chalumnae 102357443   S1PR5A  sphingosine-1-phosphate_receptor_5a 44525   60866   -       NW_005819399.1  XP_005997536.1
    XP_005997543.2  7897    Latimeria_chalumnae 102357702   ZBTB40  zinc_finger_and_BTB_domain_containing_40    331820  386431  -       NW_005819399.1  XP_014344780.1|XP_014344781.1|XP_014344782.1
    XP_005997543.2  7897    Latimeria_chalumnae 102358331   WNT4    Wnt_family_member_4 1146510 1165850 +       NW_005819399.1  XP_014344784.1
    XP_005997543.2  7897    Latimeria_chalumnae 102358869   CDC42   cell_division_cycle_42  1367899 1453373 -       NW_005819399.1  XP_014344785.1
    

    I'm assuming above that each target id from the first file only appears once in the second file.

    I'm printing the value from the first file before each matching line from the second file since you didn't show how else to produce output for multiple matching first-file strings - do whatever you like to produce different output.

    I'm also employing a Decorate-Sort-Undecorate idiom to get the output grouped by target id while also sorted in the original input order per id.

    If you have GNU awk you can use it's builtin ARGIND variable instead of manually populating/using argind as I do above.