I have a cluster file for sequences clustering at 100% sequence identity, each containing sequence clusters denoted by cluster numbers followed by IDs. Here's an example of the file format:
>Cluster 94
0 416aa, >crgi_XP_019918436.1... *
>Cluster 95
0 415aa, >TRINITY_DN19027_c1_g1_i3_77... *
>Cluster 96
0 415aa, >TRINITY_DN31043_c0_g1_i1_68... *
>Cluster 97
0 414aa, >TRINITY_DN4649_c0_g1_i1_24... *
1 414aa, >TRINITY_DN4649_c0_g1_i2_29... at 100.00%
2 414aa, >TRINITY_DN4649_c0_g1_i3_23... at 100.00%
3 414aa, >TRINITY_DN4649_c0_g1_i4_23... at 100.00%
4 414aa, >TRINITY_DN4649_c0_g1_i5_27... at 100.00%
>Cluster 98
0 414aa, >TRINITY_DN11838_c0_g1_i1_77... *
>Cluster 99
0 414aa, >TRINITY_DN13974_c0_g1_i1_33... *
1 414aa, >TRINITY_DN13974_c0_g1_i3_29... at 100.00%
2 414aa, >TRINITY_DN13974_c0_g1_i4_34... at 100.00%
>Cluster 100
0 414aa, >TRINITY_DN15453_c0_g1_i1_84... *
>Cluster 101
0 414aa, >TRINITY_DN23802_c0_g1_i1_50... *
>Cluster 102
0 414aa, >TRINITY_DN24921_c0_g1_i1_102... *
>Cluster 103
0 414aa, >TRINITY_DN26456_c0_g1_i1_61... *
>Cluster 104
0 414aa, >TRINITY_DN49299_c0_g1_i1_30... *
>Cluster 105
0 412aa, >TRINITY_DN13974_c0_g1_i2_27... *
>Cluster 106
0 411aa, >TRINITY_DN5517_c0_g1_i1_28... *
1 411aa, >TRINITY_DN5517_c0_g1_i2_35... at 100.00%
>Cluster 107
0 410aa, >TRINITY_DN9528_c0_g1_i1_30... *
1 410aa, >TRINITY_DN9528_c0_g1_i2_36... at 100.00%
2 404aa, >crgi_XP_011414097.1... at 100.00%
>Cluster 108
0 410aa, >TRINITY_DN11082_c0_g1_i1_69... *
1 410aa, >TRINITY_DN11082_c0_g1_i2_69... at 100.00%
>Cluster 109
0 410aa, >crgi_XP_011450995.2... *
>Cluster 110
0 407aa, >TRINITY_DN4674_c0_g1_i3_24... *
>Cluster 111
0 407aa, >TRINITY_DN11077_c0_g2_i1_64... *
>Cluster 112
0 407aa, >TRINITY_DN11838_c0_g1_i2_78... *
>Cluster 113
0 407aa, >TRINITY_DN11978_c0_g1_i1_33... *
>Cluster 114
0 406aa, >TRINITY_DN26600_c0_g1_i1_86... *
>Cluster 115
0 406aa, >TRINITY_DN50774_c0_g1_i1_65... *
>Cluster 116
0 406aa, >crgi_XP_034333283.1... *
>Cluster 117
0 405aa, >TRINITY_DN24741_c0_g1_i1_31... *
>Cluster 118
0 405aa, >crgi_XP_034330996.1... *
>Cluster 119
0 404aa, >TRINITY_DN55849_c0_g1_i1_21... *
>Cluster 120
0 403aa, >TRINITY_DN1565_c0_g1_i18_115... *
1 403aa, >TRINITY_DN1565_c0_g1_i8_114... at 100.00%
>Cluster 121
0 402aa, >TRINITY_DN31890_c0_g1_i1_66... *
>Cluster 122
0 401aa, >TRINITY_DN444_c1_g2_i1_94... *
1 401aa, >TRINITY_DN444_c1_g2_i2_93... at 100.00%
>Cluster 123
0 401aa, >TRINITY_DN9477_c0_g1_i1_50... *
1 401aa, >TRINITY_DN9477_c0_g1_i2_50... at 100.00%
>Cluster 124
0 401aa, >TRINITY_DN13005_c0_g1_i3_78... *
>Cluster 125
0 401aa, >TRINITY_DN15453_c0_g1_i3_151... *
>Cluster 126
0 401aa, >TRINITY_DN36984_c0_g1_i1_72... *
>Cluster 127
0 400aa, >TRINITY_DN4263_c0_g1_i1_28... *
>Cluster 128
0 399aa, >TRINITY_DN1565_c0_g1_i27_112... *
>Cluster 129
0 399aa, >TRINITY_DN1565_c0_g1_i39_125... *
>Cluster 130
0 399aa, >TRINITY_DN10913_c0_g1_i1_62... *
>Cluster 131
0 399aa, >TRINITY_DN15453_c0_g1_i3_127... *
>Cluster 132
0 399aa, >TRINITY_DN19187_c0_g1_i1_77... *
>Cluster 133
0 399aa, >TRINITY_DN24227_c0_g1_i1_70... *
>Cluster 134
0 398aa, >TRINITY_DN31271_c0_g1_i1_85... *
>Cluster 135
0 398aa, >crgi_XP_034303199.1... *
>Cluster 136
0 397aa, >TRINITY_DN1856_c0_g1_i1_195... *
1 397aa, >TRINITY_DN1856_c0_g1_i2_125... at 100.00%
2 397aa, >TRINITY_DN1856_c0_g1_i3_143... at 100.00%
3 397aa, >TRINITY_DN1856_c0_g1_i4_126... at 100.00%
>Cluster 137
0 397aa, >TRINITY_DN7611_c0_g1_i1_86... *
>Cluster 138
0 397aa, >TRINITY_DN16851_c0_g2_i1_78... *
>Cluster 139
0 397aa, >TRINITY_DN22688_c0_g1_i1_21... *
>Cluster 140
0 397aa, >crgi_XP_011418396.2... *
>Cluster 141
0 397aa, >crgi_XP_011412408.2... *
>Cluster 142
0 396aa, >TRINITY_DN12414_c0_g1_i2_20... *
1 396aa, >TRINITY_DN12414_c0_g1_i9_20... at 100.00%
>Cluster 143
0 396aa, >TRINITY_DN13115_c0_g2_i1_118... *
>Cluster 144
0 396aa, >TRINITY_DN27728_c0_g1_i1_30... *
>Cluster 145
0 396aa, >crgi_XP_034332840.1... *
>Cluster 146
0 395aa, >TRINITY_DN1496_c1_g2_i1_79... *
>Cluster 147
0 395aa, >TRINITY_DN1790_c0_g2_i9_100... *
>Cluster 148
0 395aa, >TRINITY_DN8122_c0_g1_i1_129... *
I want to write a bash script that can extract clusters containing a specific string in the ID, but only if the cluster has other sequences besides the one with the specified string. For example, if I input the string "crgi", the script should fetch only the clusters with this string in the ID, but not if it's the only sequence in the cluster.
Here's an example of the expected output for the input string "crgi":
Clusters in file1.ids containing 'crgi':
107
I've tried using grep, awk, and cut, but I'm having trouble extracting the desired clusters efficiently.
Could someone please provide guidance on how to write such a bash script to achieve this task efficiently? Any help would be greatly appreciated! Thank you.
I've tried with the following script but it didn't work:
#!/bin/bash
# Define the search string
search_string="crgi"
# Loop through each file
for file in *.ids; do
# Extract cluster numbers containing the search string
clusters=$(awk -v search="$search_string" '$0 ~ search {print $1}' "$file")
# Initialize a flag for presence of other sequences
other_sequences_found=false
# Check if clusters contain other sequences
while read -r cluster; do
# Check if the cluster contains other sequences besides the search string
if [ "$(grep -c ">$search_string" "$file")" -gt 1 ]; then
other_sequences_found=true
break
fi
done <<< "$clusters"
# If other sequences found, print the cluster numbers
if [ "$other_sequences_found" = true ]; then
echo "Clusters in $file containing '$search_string':"
echo "$clusters"
echo
fi
done
Here's a shell script that processes the input line by line:
#!/bin/bash
search=$1
lines=()
process_lines() {
if [[ ${#lines[@]} -le 2 ]] ; then
return
fi
found=""
local line
for line in "${lines[@]:1}" ; do
if [[ $line = *"$search"* ]] ; then
found=1
fi
done
if [[ $found ]] ; then
echo "${lines[0]#>Cluster }"
fi
}
printf "Clusters containing '%s':\n" "$search"
while read line ; do
if [[ $line = '>'Cluster* ]]; then
process_lines
lines=()
fi
lines+=("$line")
done
process_lines