Search code examples
bashawkgrep

How to extract sequence cluster numbers containing a specific string in the ID from a cluster file in bash?


I have a cluster file for sequences clustering at 100% sequence identity, each containing sequence clusters denoted by cluster numbers followed by IDs. Here's an example of the file format:

>Cluster 94
0       416aa, >crgi_XP_019918436.1... *
>Cluster 95
0       415aa, >TRINITY_DN19027_c1_g1_i3_77... *
>Cluster 96
0       415aa, >TRINITY_DN31043_c0_g1_i1_68... *
>Cluster 97
0       414aa, >TRINITY_DN4649_c0_g1_i1_24... *
1       414aa, >TRINITY_DN4649_c0_g1_i2_29... at 100.00%
2       414aa, >TRINITY_DN4649_c0_g1_i3_23... at 100.00%
3       414aa, >TRINITY_DN4649_c0_g1_i4_23... at 100.00%
4       414aa, >TRINITY_DN4649_c0_g1_i5_27... at 100.00%
>Cluster 98
0       414aa, >TRINITY_DN11838_c0_g1_i1_77... *
>Cluster 99
0       414aa, >TRINITY_DN13974_c0_g1_i1_33... *
1       414aa, >TRINITY_DN13974_c0_g1_i3_29... at 100.00%
2       414aa, >TRINITY_DN13974_c0_g1_i4_34... at 100.00%
>Cluster 100
0       414aa, >TRINITY_DN15453_c0_g1_i1_84... *
>Cluster 101
0       414aa, >TRINITY_DN23802_c0_g1_i1_50... *
>Cluster 102
0       414aa, >TRINITY_DN24921_c0_g1_i1_102... *
>Cluster 103
0       414aa, >TRINITY_DN26456_c0_g1_i1_61... *
>Cluster 104
0       414aa, >TRINITY_DN49299_c0_g1_i1_30... *
>Cluster 105
0       412aa, >TRINITY_DN13974_c0_g1_i2_27... *
>Cluster 106
0       411aa, >TRINITY_DN5517_c0_g1_i1_28... *
1       411aa, >TRINITY_DN5517_c0_g1_i2_35... at 100.00%
>Cluster 107
0       410aa, >TRINITY_DN9528_c0_g1_i1_30... *
1       410aa, >TRINITY_DN9528_c0_g1_i2_36... at 100.00%
2       404aa, >crgi_XP_011414097.1... at 100.00%
>Cluster 108
0       410aa, >TRINITY_DN11082_c0_g1_i1_69... *
1       410aa, >TRINITY_DN11082_c0_g1_i2_69... at 100.00%
>Cluster 109
0       410aa, >crgi_XP_011450995.2... *
>Cluster 110
0       407aa, >TRINITY_DN4674_c0_g1_i3_24... *
>Cluster 111
0       407aa, >TRINITY_DN11077_c0_g2_i1_64... *
>Cluster 112
0       407aa, >TRINITY_DN11838_c0_g1_i2_78... *
>Cluster 113
0       407aa, >TRINITY_DN11978_c0_g1_i1_33... *
>Cluster 114
0       406aa, >TRINITY_DN26600_c0_g1_i1_86... *
>Cluster 115
0       406aa, >TRINITY_DN50774_c0_g1_i1_65... *
>Cluster 116
0       406aa, >crgi_XP_034333283.1... *
>Cluster 117
0       405aa, >TRINITY_DN24741_c0_g1_i1_31... *
>Cluster 118
0       405aa, >crgi_XP_034330996.1... *
>Cluster 119
0       404aa, >TRINITY_DN55849_c0_g1_i1_21... *
>Cluster 120
0       403aa, >TRINITY_DN1565_c0_g1_i18_115... *
1       403aa, >TRINITY_DN1565_c0_g1_i8_114... at 100.00%
>Cluster 121
0       402aa, >TRINITY_DN31890_c0_g1_i1_66... *
>Cluster 122
0       401aa, >TRINITY_DN444_c1_g2_i1_94... *
1       401aa, >TRINITY_DN444_c1_g2_i2_93... at 100.00%
>Cluster 123
0       401aa, >TRINITY_DN9477_c0_g1_i1_50... *
1       401aa, >TRINITY_DN9477_c0_g1_i2_50... at 100.00%
>Cluster 124
0       401aa, >TRINITY_DN13005_c0_g1_i3_78... *
>Cluster 125
0       401aa, >TRINITY_DN15453_c0_g1_i3_151... *
>Cluster 126
0       401aa, >TRINITY_DN36984_c0_g1_i1_72... *
>Cluster 127
0       400aa, >TRINITY_DN4263_c0_g1_i1_28... *
>Cluster 128
0       399aa, >TRINITY_DN1565_c0_g1_i27_112... *
>Cluster 129
0       399aa, >TRINITY_DN1565_c0_g1_i39_125... *
>Cluster 130
0       399aa, >TRINITY_DN10913_c0_g1_i1_62... *
>Cluster 131
0       399aa, >TRINITY_DN15453_c0_g1_i3_127... *
>Cluster 132
0       399aa, >TRINITY_DN19187_c0_g1_i1_77... *
>Cluster 133
0       399aa, >TRINITY_DN24227_c0_g1_i1_70... *
>Cluster 134
0       398aa, >TRINITY_DN31271_c0_g1_i1_85... *
>Cluster 135
0       398aa, >crgi_XP_034303199.1... *
>Cluster 136
0       397aa, >TRINITY_DN1856_c0_g1_i1_195... *
1       397aa, >TRINITY_DN1856_c0_g1_i2_125... at 100.00%
2       397aa, >TRINITY_DN1856_c0_g1_i3_143... at 100.00%
3       397aa, >TRINITY_DN1856_c0_g1_i4_126... at 100.00%
>Cluster 137
0       397aa, >TRINITY_DN7611_c0_g1_i1_86... *
>Cluster 138
0       397aa, >TRINITY_DN16851_c0_g2_i1_78... *
>Cluster 139
0       397aa, >TRINITY_DN22688_c0_g1_i1_21... *
>Cluster 140
0       397aa, >crgi_XP_011418396.2... *
>Cluster 141
0       397aa, >crgi_XP_011412408.2... *
>Cluster 142
0       396aa, >TRINITY_DN12414_c0_g1_i2_20... *
1       396aa, >TRINITY_DN12414_c0_g1_i9_20... at 100.00%
>Cluster 143
0       396aa, >TRINITY_DN13115_c0_g2_i1_118... *
>Cluster 144
0       396aa, >TRINITY_DN27728_c0_g1_i1_30... *
>Cluster 145
0       396aa, >crgi_XP_034332840.1... *
>Cluster 146
0       395aa, >TRINITY_DN1496_c1_g2_i1_79... *
>Cluster 147
0       395aa, >TRINITY_DN1790_c0_g2_i9_100... *
>Cluster 148
0       395aa, >TRINITY_DN8122_c0_g1_i1_129... *

I want to write a bash script that can extract clusters containing a specific string in the ID, but only if the cluster has other sequences besides the one with the specified string. For example, if I input the string "crgi", the script should fetch only the clusters with this string in the ID, but not if it's the only sequence in the cluster.

Here's an example of the expected output for the input string "crgi":

Clusters in file1.ids containing 'crgi':
107

I've tried using grep, awk, and cut, but I'm having trouble extracting the desired clusters efficiently.

Could someone please provide guidance on how to write such a bash script to achieve this task efficiently? Any help would be greatly appreciated! Thank you.

I've tried with the following script but it didn't work:

#!/bin/bash

# Define the search string
search_string="crgi"

# Loop through each file
for file in *.ids; do
    # Extract cluster numbers containing the search string
    clusters=$(awk -v search="$search_string" '$0 ~ search {print $1}' "$file")

    # Initialize a flag for presence of other sequences
    other_sequences_found=false

    # Check if clusters contain other sequences
    while read -r cluster; do
        # Check if the cluster contains other sequences besides the search string
        if [ "$(grep -c ">$search_string" "$file")" -gt 1 ]; then
            other_sequences_found=true
            break
        fi
    done <<< "$clusters"

    # If other sequences found, print the cluster numbers
    if [ "$other_sequences_found" = true ]; then
        echo "Clusters in $file containing '$search_string':"
        echo "$clusters"
        echo
    fi
done

Solution

  • Here's a shell script that processes the input line by line:

    #!/bin/bash
    search=$1
    lines=()
    
    process_lines() {
        if [[ ${#lines[@]} -le 2 ]] ; then
            return
        fi
        found=""
        local line
        for line in "${lines[@]:1}" ; do
            if [[ $line = *"$search"* ]] ; then
                found=1
            fi
        done
        if [[ $found ]] ; then
            echo "${lines[0]#>Cluster }"
        fi
    }
    
    printf "Clusters containing '%s':\n" "$search"
    while read line ; do
        if [[ $line = '>'Cluster* ]]; then
            process_lines
            lines=()
        fi
        lines+=("$line")
    done
    process_lines