I'm working on a project where I need to extract specific gene neighborhood information from a dataset based on protein accession IDs. I have a series of .acc files containing protein accession IDs, and I need to perform the following tasks:
7897 Latimeria_chalumnae 102356403 FLII FLII_actin_remodeling_protein 275088 356473 - NW_005819398.1 XP_014344771.1
7897 Latimeria_chalumnae 102352677 MIEF2 mitochondrial_elongation_factor_2 356900 369656 + NW_005819398.1 XP_005997516.1|XP_005997517.1|XP_014344772.1
7897 Latimeria_chalumnae 102352038 TOP3A DNA_topoisomerase_III_alpha 381055 421312 - NW_005819398.1 XP_005997514.1|XP_014344773.1
7897 Latimeria_chalumnae 102351764 SMCR8A Smith-Magenis_syndrome_chromosome_region,_candidate_8a 421448 432658 + NW_005819398.1 XP_005997512.1
7897 Latimeria_chalumnae 102353118 SHMT1 serine_hydroxymethyltransferase_1_(soluble) 453487 501768 - NW_005819398.1 XP_005997518.2
7897 Latimeria_chalumnae 102353382 PRPSAP2 phosphoribosyl_pyrophosphate_synthetase-associated_protein_2 511591 561160 + NW_005819398.1 XP_005997519.1|XP_005997520.1|XP_005997521.1|XP_005997522.1|XP_014344774.1|XP_014344775.1
7897 Latimeria_chalumnae 106703785 SLC5A10 solute_carrier_family_5_member_10 577897 737478 + NW_005819398.1 XP_014344776.1
7897 Latimeria_chalumnae 102354157 FAM83GA family_with_sequence_similarity_83_member_Ga 653729 709964 - NW_005819398.1 XP_005997523.1
7897 Latimeria_chalumnae 102356925 LOC102356925 uncharacterized_protein_KIAA0195-like 818471 943407 + NW_005819398.1 XP_014344777.1
7897 Latimeria_chalumnae 102357171 LOC102357171 uncharacterized_protein_KIAA0195-like 971087 1041031 + NW_005819398.1 XP_014344778.1
7897 Latimeria_chalumnae 106703788 LOC106703788 zinc_finger_MYM-type_protein_1-like 1095688 1096413 - NW_005819398.1 XP_014344779.1
7897 Latimeria_chalumnae 102354700 EPN2 epsin_2 1150804 1258003 + NW_005819398.1 XP_005997525.1|XP_005997526.1
7897 Latimeria_chalumnae 102355121 LOC102355121 melanin-concentrating_hormone_receptor_1-like 1369192 1370937 + NW_005819398.1 XP_005997527.1
7897 Latimeria_chalumnae 102355380 B9D1 B9_protein_domain_1 1397029 1461568 - NW_005819398.1 XP_005997528.1|XP_005997529.1
7897 Latimeria_chalumnae 102358598 LOC102358598 peroxiredoxin-2-like 2491 40092 - NW_005819399.1 XP_014344786.1
7897 Latimeria_chalumnae 102357443 S1PR5A sphingosine-1-phosphate_receptor_5a 44525 60866 - NW_005819399.1 XP_005997536.1
7897 Latimeria_chalumnae 102357702 ZBTB40 zinc_finger_and_BTB_domain_containing_40 331820 386431 - NW_005819399.1 XP_014344780.1|XP_014344781.1|XP_014344782.1
7897 Latimeria_chalumnae 102358331 WNT4 Wnt_family_member_4 1146510 1165850 + NW_005819399.1 XP_014344784.1
7897 Latimeria_chalumnae 102358869 CDC42 cell_division_cycle_42 1367899 1453373 - NW_005819399.1 XP_014344785.1
7897 Latimeria_chalumnae 102359147 LOC102359147 phosphoserine_aminotransferase-like 1152 12401 - NW_005819400.1 XP_005997543.2
7897 Latimeria_chalumnae 102360660 LOC102360660 phosphoserine_aminotransferase-like 15454 41930 - NW_005819400.1 XP_014344787.1
7897 Latimeria_chalumnae 102360916 CEP78 centrosomal_protein_78 69138 156223 - NW_005819400.1 XP_005997550.1
7897 Latimeria_chalumnae 102359416 LOC102359416 guanine_nucleotide-binding_protein_G(q)_subunit_alpha 239201 464669 + NW_005819400.1 XP_005997544.1|XP_005997545.1
7897 Latimeria_chalumnae 102359865 LOC102359865 guanine_nucleotide-binding_protein_subunit_alpha-14-like 532847 621010 + NW_005819400.1 XP_005997546.2
7897 Latimeria_chalumnae 102360136 LOC102360136 guanine_nucleotide-binding_protein_subunit_alpha-14 728622 742912 + NW_005819400.1 XP_005997547.1
7897 Latimeria_chalumnae 102361174 VPS13A vacuolar_protein_sorting_13_homolog_A 760611 959552 - NW_005819400.1 XP_014344789.1
7897 Latimeria_chalumnae 102361432 LOC102361432 C2_calcium-dependent_domain-containing_protein_4C-like 971747 987261 - NW_005819400.1 XP_014344790.1
7897 Latimeria_chalumnae 102360402 FOXB2 forkhead_box_B2 1123429 1124373 - NW_005819400.1 XP_005997548.1
7897 Latimeria_chalumnae 106703789 LOC106703789 protein_prune_homolog_2-like 1225384 1483856 + NW_005819400.1 XP_014344791.1
7897 Latimeria_chalumnae 102361698 TBC1D4 TBC1_domain_family,_member_4 313607 553482 - NW_005819401.1 XP_005997553.1|XP_005997555.1|XP_005997556.1|XP_005997561.1|XP_014344792.1|XP_014344793.1|XP_014344794.1
7897 Latimeria_chalumnae 102367238 COMMD6 COMM_domain_containing_6 567595 585926 - NW_005819401.1 XP_014344795.1
Example
Suppose I have a protein accession 'XP_005997536.1' in the .acc file and I use grep to extract that row. The relevant lines surrounding this accession might look like this:
7897 Latimeria_chalumnae 106703788 LOC106703788 zinc_finger_MYM-type_protein_1-like 1095688 1096413 - NW_005819398.1 XP_014344779.1
7897 Latimeria_chalumnae 102354700 EPN2 epsin_2 1150804 1258003 + NW_005819398.1 XP_005997525.1|XP_005997526.1
7897 Latimeria_chalumnae 102355121 LOC102355121 melanin-concentrating_hormone_receptor_1-like 1369192 1370937 + NW_005819398.1 XP_005997527.1
7897 Latimeria_chalumnae 102355380 B9D1 B9_protein_domain_1 1397029 1461568 - NW_005819398.1 XP_005997528.1|XP_005997529.1
7897 Latimeria_chalumnae 102358598 LOC102358598 peroxiredoxin-2-like 2491 40092 - NW_005819399.1 XP_014344786.1
**7897 Latimeria_chalumnae 102357443 S1PR5A sphingosine-1-phosphate_receptor_5a 44525 60866 - NW_005819399.1 XP_005997536.1**
7897 Latimeria_chalumnae 102357702 ZBTB40 zinc_finger_and_BTB_domain_containing_40 331820 386431 - NW_005819399.1 XP_014344780.1|XP_014344781.1|XP_014344782.1
7897 Latimeria_chalumnae 102358331 WNT4 Wnt_family_member_4 1146510 1165850 + NW_005819399.1 XP_014344784.1
7897 Latimeria_chalumnae 102358869 CDC42 cell_division_cycle_42 1367899 1453373 - NW_005819399.1 XP_014344785.1
7897 Latimeria_chalumnae 102359147 LOC102359147 phosphoserine_aminotransferase-like 1152 12401 - NW_005819400.1 XP_005997543.2
7897 Latimeria_chalumnae 102360660 LOC102360660 phosphoserine_aminotransferase-like 15454 41930 - NW_005819400.1 XP_014344787.1
From this, I want to filter out any lines that do not contain NW_005819399.1 in the 9th column.
What I've Tried
I attempted to write a bash script to achieve this, but I'm running into issues with filtering the results correctly and ensuring that the output format remains clean. Here's a snippet of my current approach:
#!/usr/bin/bash
# Function to process each line of the file
process_line() {
local line="$1"
local file="$2"
local index="$3"
# Extract the protein accession (10th column) and genomic accession (9th column)
local protein_accession=$(echo "$line" | awk '{print $10}')
local genomic_accession=$(echo "$line" | awk '{print $9}')
# Get the surrounding lines based on the protein accession
local temp_lines=$(grep -w -m 1 -A5 -B5 "$protein_accession" gene_neighborhood.final)
# Save the surrounding lines to a temporary file
echo "$temp_lines" > "/tmp/temp_neighborhood_$index.txt"
# Filter the lines based on genomic accession presence
awk -v acc="$genomic_accession" '$9 == acc' /tmp/temp_neighborhood_$index.txt >> "${file}.${index}.gene.neighborhood"
}
# Function to process each file
process_file() {
local file="$1"
local index=1
while IFS= read -r line; do
process_line "$line" "$file" "$index"
((index++))
done < "$file"
# Clean up temporary files
rm /tmp/temp_neighborhood_*.txt
}
# Main script
for file in *.acc; do
process_file "$file"
done
What I was expecting:
7897 Latimeria_chalumnae 102358598 LOC102358598 peroxiredoxin-2-like 2491 40092 - NW_005819399.1 XP_014344786.1
7897 Latimeria_chalumnae 102357443 S1PR5A sphingosine-1-phosphate_receptor_5a 44525 60866 - NW_005819399.1 XP_005997536.1
7897 Latimeria_chalumnae 102357702 ZBTB40 zinc_finger_and_BTB_domain_containing_40 331820 386431 - NW_005819399.1 XP_014344780.1|XP_014344781.1|XP_014344782.1
7897 Latimeria_chalumnae 102358331 WNT4 Wnt_family_member_4 1146510 1165850 + NW_005819399.1 XP_014344784.1
7897 Latimeria_chalumnae 102358869 CDC42 cell_division_cycle_42 1367899 1453373 - NW_005819399.1 XP_014344785.1
What I got as an output:
XP_005997536.1
Can anyone suggest a more reliable approach or provide a full example of how to implement this in bash? I’m looking for clear steps, any potential pitfalls, and best practices for processing text files in this manner.
From your description of your requirements it sounds like this is what you're trying to do, using a 2-pass approach with any awk:
$ awk '
NR==FNR { if (/XP_005997536\.1/) {beg=NR-5; end=NR+5; bad=$9} next }
(beg <= FNR) && (FNR <= end) && ($9 != bad)
' file file
7897 Latimeria_chalumnae 106703788 LOC106703788 zinc_finger_MYM-type_protein_1-like 1095688 1096413 - NW_005819398.1 XP_014344779.1
7897 Latimeria_chalumnae 102354700 EPN2 epsin_2 1150804 1258003 + NW_005819398.1 XP_005997525.1|XP_005997526.1
7897 Latimeria_chalumnae 102355121 LOC102355121 melanin-concentrating_hormone_receptor_1-like 1369192 1370937 + NW_005819398.1 XP_005997527.1
7897 Latimeria_chalumnae 102355380 B9D1 B9_protein_domain_1 1397029 1461568 - NW_005819398.1 XP_005997528.1|XP_005997529.1
7897 Latimeria_chalumnae 102359147 LOC102359147 phosphoserine_aminotransferase-like 1152 12401 - NW_005819400.1 XP_005997543.2
7897 Latimeria_chalumnae 102360660 LOC102360660 phosphoserine_aminotransferase-like 15454 41930 - NW_005819400.1 XP_014344787.1
but that doesn't produce the expected output you posted so something isn't clear about your requirements for this part of your problem. Your example is also missing 1 of the 2 types of input file you mention so there's no point me trying to investigate the differences yet BUT I think this is what you're really trying to do overall (foo.acc
contains the target list of ids, one per line, and bar.gen
is the sample input file provided in the question):
$ head foo.acc bar.gen
==> foo.acc <==
XP_005997536.1
XP_005997543.2
==> bar.gen <==
7897 Latimeria_chalumnae 102356403 FLII FLII_actin_remodeling_protein 275088 356473 - NW_005819398.1 XP_014344771.1
7897 Latimeria_chalumnae 102352677 MIEF2 mitochondrial_elongation_factor_2 356900 369656 + NW_005819398.1 XP_005997516.1|XP_005997517.1|XP_014344772.1
7897 Latimeria_chalumnae 102352038 TOP3A DNA_topoisomerase_III_alpha 381055 421312 - NW_005819398.1 XP_005997514.1|XP_014344773.1
7897 Latimeria_chalumnae 102351764 SMCR8A Smith-Magenis_syndrome_chromosome_region,_candidate_8a 421448 432658 + NW_005819398.1 XP_005997512.1
7897 Latimeria_chalumnae 102353118 SHMT1 serine_hydroxymethyltransferase_1_(soluble) 453487 501768 - NW_005819398.1 XP_005997518.2
7897 Latimeria_chalumnae 102353382 PRPSAP2 phosphoribosyl_pyrophosphate_synthetase-associated_protein_2 511591 561160 + NW_005819398.1 XP_005997519.1|XP_005997520.1|XP_005997521.1|XP_005997522.1|XP_014344774.1|XP_014344775.1
7897 Latimeria_chalumnae 106703785 SLC5A10 solute_carrier_family_5_member_10 577897 737478 + NW_005819398.1 XP_014344776.1
7897 Latimeria_chalumnae 102354157 FAM83GA family_with_sequence_similarity_83_member_Ga 653729 709964 - NW_005819398.1 XP_005997523.1
7897 Latimeria_chalumnae 102356925 LOC102356925 uncharacterized_protein_KIAA0195-like 818471 943407 + NW_005819398.1 XP_014344777.1
7897 Latimeria_chalumnae 102357171 LOC102357171 uncharacterized_protein_KIAA0195-like 971087 1041031 + NW_005819398.1 XP_014344778.1
$ cat tst.sh
#!/usr/bin/env bash
awk -v OFS='\t' '
FNR == 1 {
argind++
}
argind == 1 {
tgt_accs[$1]
}
argind == 2 {
split($10,cur_accs,"|")
for (i in cur_accs) {
acc = cur_accs[i]
if ( acc in tgt_accs ) {
begs[acc] = FNR - 5
ends[acc] = FNR + 5
bads[acc] = $9
}
}
}
argind == 3 {
for ( acc in begs ) {
beg = begs[acc]
end = ends[acc]
bad = bads[acc]
if ( (beg <= FNR) && (FNR <= end) && ($9 != bad) ) {
print FNR, acc, $0
}
}
}
' "$1" "$2" "$2" |
sort -k2,2 -k1,1n |
cut -f2-
$ ./tst.sh foo.acc bar.gen
XP_005997536.1 7897 Latimeria_chalumnae 106703788 LOC106703788 zinc_finger_MYM-type_protein_1-like 1095688 1096413 - NW_005819398.1 XP_014344779.1
XP_005997536.1 7897 Latimeria_chalumnae 102354700 EPN2 epsin_2 1150804 1258003 + NW_005819398.1 XP_005997525.1|XP_005997526.1
XP_005997536.1 7897 Latimeria_chalumnae 102355121 LOC102355121 melanin-concentrating_hormone_receptor_1-like 1369192 1370937 + NW_005819398.1 XP_005997527.1
XP_005997536.1 7897 Latimeria_chalumnae 102355380 B9D1 B9_protein_domain_1 1397029 1461568 - NW_005819398.1 XP_005997528.1|XP_005997529.1
XP_005997536.1 7897 Latimeria_chalumnae 102359147 LOC102359147 phosphoserine_aminotransferase-like 1152 12401 - NW_005819400.1 XP_005997543.2
XP_005997536.1 7897 Latimeria_chalumnae 102360660 LOC102360660 phosphoserine_aminotransferase-like 15454 41930 - NW_005819400.1 XP_014344787.1
XP_005997543.2 7897 Latimeria_chalumnae 102358598 LOC102358598 peroxiredoxin-2-like 2491 40092 - NW_005819399.1 XP_014344786.1
XP_005997543.2 7897 Latimeria_chalumnae 102357443 S1PR5A sphingosine-1-phosphate_receptor_5a 44525 60866 - NW_005819399.1 XP_005997536.1
XP_005997543.2 7897 Latimeria_chalumnae 102357702 ZBTB40 zinc_finger_and_BTB_domain_containing_40 331820 386431 - NW_005819399.1 XP_014344780.1|XP_014344781.1|XP_014344782.1
XP_005997543.2 7897 Latimeria_chalumnae 102358331 WNT4 Wnt_family_member_4 1146510 1165850 + NW_005819399.1 XP_014344784.1
XP_005997543.2 7897 Latimeria_chalumnae 102358869 CDC42 cell_division_cycle_42 1367899 1453373 - NW_005819399.1 XP_014344785.1
I'm assuming above that each target id from the first file only appears once in the second file.
I'm printing the value from the first file before each matching line from the second file since you didn't show how else to produce output for multiple matching first-file strings - do whatever you like to produce different output.
I'm also employing a Decorate-Sort-Undecorate idiom to get the output grouped by target id while also sorted in the original input order per id.
If you have GNU awk you can use it's builtin ARGIND
variable instead of manually populating/using argind
as I do above.