Search code examples
awkash

reset NR in awk


cat file.txt

MNS GYPA*N  
MNS GYPA*M  c.59T>C;c.71A>G;c.72G>T
MNS GYPA*Mc c.71G>A;c.72T>G
MNS GYPA*Vw c.140C>T
MNS GYPA*Mg c.68C>A
MNS GYPA*Vr c.197C>A
MNS GYPB*Mta    c.230C>T
MNS GYPB*Ria    c.226G>A
MNS GYPB*Nya    c.138T>A
MNS GYPA*Hut    c.140C>A
.
.
.

the second column values could start with GYPA,GYPB,GYPC,GYPD, ... GYPZ. I would like to set a position count for each GYP* and split the third column as follows:

1   MNS  GYPA*N
2   MNS GYPA*M  c.59T>C
2   MNS GYPA*M  c.71A>G
2   MNS GYPA*M  c.72G>T
3   MNS GYPA*Mc c.71G>A
3   MNS GYPA*Mc c.72T>G
4   MNS GYPA*Vw .140C>T
5   MNS GYPA*Mg c.68C>A
6   MNS GYPA*Vr c.197C>A
1   MNS GYPB*Mta    c.230C>T
2   MNS GYPB*Ria    c.226G>A
3   MNS GYPB*Nya    c.138T>A
4   MNS GYPB*Hut    c.140C>A
.
.
.

cat format.awk

BEGIN {FS=OFS="\t"}

$2 ~ /GYPA/
   { num=split($3,arr,/;/);
      for (i=1;i<=num;i++)
         { print NR,$1,$2,arr[i]}}

$2 ~ /GYPB/
   { num=split($3,arr,/;/);
      for (i=1;i<=num;i++)
         { print NR,$1,$2,arr[i]} }
...

I am not sure how to reset NR when it reaches the the next ~ GYP. The GYP{A..Z} are in order from A to Z.


Solution

  • awk '
    {
      match($2,/[^*]*/)
      gy_value=substr($2,RSTART,RLENGTH)
    }
    gy_value!=prev_gy_value{
      count=0
    }
    !arr[$2]++{
      count++
    }
    {
      num=split($3,array,";")
      for(i=1;i<=num;i++){
        print count,$1,$2,array[i]
      }
    }
    NF<3;
    {
      prev_gy_value=gy_value
    }
    ' file.txt
    

    Explanation: Adding a detailed explanation for above code.

    awk '                                   ##Starting awk program from here.
    {
      match($2,/[^*]*/)                     ##Using match function to match till * in 2nd field.
      gy_value=substr($2,RSTART,RLENGTH)    ##Creating variable gy_value which has sub-string of 2nd field sub-string in it.
    }
    gy_value!=prev_gy_value{                
      count=0                               ##Creating variable count as 0 here.
    }
    {
      count++                               ##Increasing value of count with 1 here.
    }
    {
      num=split($3,array,";")               ##Splitting 3rd field into an array with delimiter ; and its count is stored into num variable.
      for(i=1;i<=num;i++){                  ##Starting for loop from i=1 to till value of num here.
        print count,$1,$2,array[i]                ##Printing value of $1,$2 and array with index variable i here.
      }
    }
    NF<3;                                   ##Checking condition if NF<3 then print the line here.
    {
      prev_gy_value=gy_value                ##Setting value of variable gy_value to variable named prev_gy_value here(which is used above code to make sure about values check).
    }
    '  Input_file                           ##Mentioning Input_file name here.