Search code examples
stringbashgziplarge-datatext-extraction

create new column containing a substring of an existing column in data using bash


I have a large tsv.gz file (40GB) for which I want to extract a string from an existing variable col3, store it in a new variable New_var (placed at the beginning) and save everything the new file. an example of the data "old_file.tsv.gz"

col1  col2  col3  col4
1  positive  12:1234A  100
2  negative  10:9638B  110
3  positive  5:0987A  100
4  positive  8:5678A  170

Desired data "new_file.tsv.gz"

New_var  col1  col2  col3  col4
12  1  positive  12:1234A  100
10  2  negative  10:9638B  110
5  3  positive  5:0987A  100
8  4  positive  8:5678A  170

I am new in bash so I have tried multiple things but I get stuck, I have tried

zcat old_file.tsv.gz | awk '{print New_var=$3,$0 }' | awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'

I think I have multiple problems. {print New_var=$3,$0 } do create a duplicate of col3 but doesn't rename it. Then when I add the last part of the code awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'...well nothing comes up (I tried to look if I forget a parenthesis but cannot find the problem). Also I am not sure if this way is the best way to do it. Any idea how to make it work?


Solution

  • Make an AWK script in a separate file (for readability), say 1.awk:

    { if (NR > 1) { 
        # all data lines 
        split($3, a, ":");  
        print a[1], $1, $3, $3, $4; 
      } else {
        # header line
        print "new_var", $1, $2, $3, $4;
      } 
    }
    

    Now process the input (say 1.csv.gz) with the AWK file:

    zcat 1.csv.gz | awk -f 1.awk | gzip -c > 1_new.csv.gz