Search code examples
awktext-processing

Count in specific way in awk


I have a problem. This is a small fragment of my input file

SOL168 MGD750
SOL259 MGD11
SOL363 MGD38
SOL168 MGD142
SOL363 MGD784
SOL660 MGD752
SOL440 MGD38
SOL440 MGD38

I need to count specific repetition. You can count, If in the first column in two different lines you have the same SOL and in the second column you have in one line MGD1-225, you must have in another line MGD 676-900 For example

SOL115 MGD201
SOL115 MGD782

and this count as one another example

SOL749 MGD751
SOL749 MGD111

In my input file, I will expect output

2

because SOL363 have bonds with MGD38(from the first layer) and also MGD784 (from the second layer) - first vertical water bridge

SOL168 have bonds with MGD750 (second layer) and MGD142(first layer)

Now it works, my whole script

#!/bin/bash
for index in {1..100} # I do this script on 100 files, that is s why I use for loop
do
awk '
    BEGIN { FS = "MGD" }
    $2 >= 1 && $2 <= 225 { layer1[$1]++ }
    $2 >= 676 && $2 <= 900 { layer2[$1]++ }
    END {
        for (sql in layer1) {
        if (layer1[sql] == 1 && layer2[sql] == 1)
            ++total
    }
    print total
    }
' eq5_15_333_lipid_sol_fragment_$index.ndx >> vertical_water_bridges.txt 
done

Solution

  • Using MGD as your field separator, $2 becomes the numerical layer indicator and awk can express your problem statement pretty directly:

    BEGIN { FS = "MGD" }
    $2 >= 1 && $2 <= 225 { layer1[$1]++ }
    $2 >= 676 && $2 <= 900 { layer2[$1]++ }
    END {
        total = 0
        for (sql in layer1) {
            if (sql in layer2)
                ++total
        }
        print total
    }
    
    
    $ awk -f a.awk file
    2