Search code examples
parsingcsvawkcriteria

AWK - loop through a large CSV file and print a new CSV containing records that pass criteria


I am trying to parse a large CSV (about 90 MB) that has a header section (130 rows, single column). Line 131 is the column header that has names for individual column (20 columns in total).

I don't need to parse the first 9 columns, the data beginning from column #10 through colum#20 are in the following format (reformatted as column for easier view, but each row below is actually in an individual column in the CSV)

0/0:3,0:9:0,9,90

./.

0/1:6,3:49:73,0,111

0/1:13,3:99:103,0,254

0/1:6,6:45:56,3,124

0/1:2,2:34:43,0,51

0/1:80,0:87:252,5,1882

0/1:25,12:99:139,286,3281

./.

0/1:6,8:64:64,0,130

0/1:4,5:65:109,0,114

Take this "0/0:3,0:9:0,9,90" as an example, the criteria is: if the two numbers (3,0) as highlighted add up to >=20 AND the second number (0) is greater or equal to 10, then print it out to new CSV. If does not pass the criteria, just print "./." in the new CSV. The cells that have "./." already will remain "./." in the new CSV.

I have asked a part of the question in this thread use awk to process a csv (tab delimited) line by line and, thanks to Kent, I now know how to apply the criteria to a single cell, but I need to loop through 10 columns and apply the same criteria. Below is what I have come up with, but apparently, I need more help with finishing this. Any help/suggestion will be appreciated.

BEGIN {
-F'\t'; -v OFS='\t'

for (i=10; i<=20; i++)   
 {
  t=$i; 
  split(t,a,/[:,]/)}(a[2]+a[3])>=20 && a[3]>=10
  }

}

Solution

  • It would help provide an exact answer if your question included example output from the given input. But overall, you're very close to your solution ... except ... (t,a,/[:,]/)}? where's the opening '{' char? You have an imbalance of { and }s in your code. You should be getting an error message and depending on which awk, it will most likely point you to what it can't parse.

    BEGIN { -F"\t"; OFS="\t" }
    # -----------missing ----^--
    # awk doesn't like '\t' quoting, use "\t"
    # the -v is only for the command line, not in a BEGIN block
    
    # awk executes whatever is inside a { } block like illustrated below
    # for each line of data in the file
    # 
    # not clear if this is what you want
    {
    # --- note, missing opening { char
    
        for (i=10; i<=20; i++) {
          t=$i; 
          #old ver split(t,a,/[:,]/)}(a[2]+a[3])>=20 && a[3]>=10
          # new
          split(t,a,/[:,]/)
           if ( (a[2]+a[3])>=20 && a[3]>=10 ) {
              printf("%s\t", $i)
           }  # end if
         }    # end for i
         printf("\n")
    }         # end awk loop 
              # that processes each line
              # of data from a a file    
    

    To print just part of a line, as I understand your question, you'll have to rely on awk more granular printf("%s", "string") functionality. Note that I'm passing $i as the parameter to print, you may need to use one of your array variables a[2] or whatever.

    IHTH