Search code examples
bashawksedgrep

How to Extract Clusters with Multiple Rows Using Bash Commands?


I'm trying to extract clusters from a text file using Bash commands. Each cluster is delineated by a line starting with >Cluster. I want to extract only those clusters with more than one data row within them. Here's a simplified example of my input file:

>Cluster 199
0       2599aa, >CAD5117741.1... *
>Cluster 200
0       2579aa, >CAD5112262.1... *
>Cluster 201
0       2578aa, >CAD5116287.1... *
>Cluster 202
0       2578aa, >CAD5122864.1... *
1       1867aa, >CAD5122865.1... at 100.00%
2       2369aa, >CAD5122866.1... at 100.00%
>Cluster 203
0       2573aa, >CAD5110750.1... *
>Cluster 204
0       2571aa, >CAD5116249.1... *
>Cluster 205
0       2558aa, >CAD5122682.1... *
>Cluster 206
0       2553aa, >CAD5126525.1... *
>Cluster 207
0       2551aa, >CAD5115834.1... *

In this example, I want to extract only Cluster 202 because it has more than one row of data within it. The desired output would be:

>Cluster 202
0       2578aa, >CAD5122864.1... *
1       1867aa, >CAD5122865.1... at 100.00%
2       2369aa, >CAD5122866.1... at 100.00%

I'm currently using awk to process the file but struggling to figure out how to extract these clusters properly. Can someone guide me in accomplishing this task efficiently using Bash commands?

I attempted to use the following awk command:

awk '/^>Cluster/ {cluster=$0; count=0; next} {count++} count > 1 {print cluster; print} count == 0 {print cluster}'

When applied to the provided data, it produced the following output:

>Cluster 202 2 2369aa, >CAD5122866.1... at 100.00%

This output is incomplete, as it should include all lines within Cluster 202.


Solution

  • Here is a simple working script (just tested and without awk):

    #!/bin/bash
    
    function process_cluster {
        
        input_file=$1
        current_cluster=""
        data_rows=0
    
        while read -r line; do
            if [[ $line == ">Cluster "* ]]; then
                if [[ $data_rows -gt 1 ]]; then
                    echo -e "$current_cluster" 
                fi
                
                current_cluster="$line"  
                data_rows=0
            else
                ((data_rows++))
                current_cluster="$current_cluster\n$line" 
            fi
        done < $input_file
    
        if [[ $data_rows -gt 1 ]]; then
            echo -e "$current_cluster"
        fi
    
    }
    
    process_cluster clusters.txt > output.txt
    

    Remember to save your input data into a file named clusters.txt or change it into the above script.