I'm trying to extract clusters from a text file using Bash commands. Each cluster is delineated by a line starting with >Cluster. I want to extract only those clusters with more than one data row within them. Here's a simplified example of my input file:
>Cluster 199
0 2599aa, >CAD5117741.1... *
>Cluster 200
0 2579aa, >CAD5112262.1... *
>Cluster 201
0 2578aa, >CAD5116287.1... *
>Cluster 202
0 2578aa, >CAD5122864.1... *
1 1867aa, >CAD5122865.1... at 100.00%
2 2369aa, >CAD5122866.1... at 100.00%
>Cluster 203
0 2573aa, >CAD5110750.1... *
>Cluster 204
0 2571aa, >CAD5116249.1... *
>Cluster 205
0 2558aa, >CAD5122682.1... *
>Cluster 206
0 2553aa, >CAD5126525.1... *
>Cluster 207
0 2551aa, >CAD5115834.1... *
In this example, I want to extract only Cluster 202 because it has more than one row of data within it. The desired output would be:
>Cluster 202
0 2578aa, >CAD5122864.1... *
1 1867aa, >CAD5122865.1... at 100.00%
2 2369aa, >CAD5122866.1... at 100.00%
I'm currently using awk to process the file but struggling to figure out how to extract these clusters properly. Can someone guide me in accomplishing this task efficiently using Bash commands?
I attempted to use the following awk command:
awk '/^>Cluster/ {cluster=$0; count=0; next} {count++} count > 1 {print cluster; print} count == 0 {print cluster}'
When applied to the provided data, it produced the following output:
>Cluster 202 2 2369aa, >CAD5122866.1... at 100.00%
This output is incomplete, as it should include all lines within Cluster 202.
Here is a simple working script (just tested and without awk):
#!/bin/bash
function process_cluster {
input_file=$1
current_cluster=""
data_rows=0
while read -r line; do
if [[ $line == ">Cluster "* ]]; then
if [[ $data_rows -gt 1 ]]; then
echo -e "$current_cluster"
fi
current_cluster="$line"
data_rows=0
else
((data_rows++))
current_cluster="$current_cluster\n$line"
fi
done < $input_file
if [[ $data_rows -gt 1 ]]; then
echo -e "$current_cluster"
fi
}
process_cluster clusters.txt > output.txt
Remember to save your input data into a file named clusters.txt
or change it into the above script.