Search code examples
linuxawkterminalgrep

Conditional Editing of Large TSV in Linux


I have many large TSV files that have 6 unnamed columns and around 10 million rows. I need to apply a change on a row-by-row basis and save as a new TSV. I am working on Linux and have access to awk, gawk, grep. Ideally the code would reference column numbers in its operators so it functions regardless of row content. TYIA!

The only changes that need to be made are to Columns 2 & 3, but the change is conditional on the value within Column 6:

If the value of column 6 is "+", then:  
New Column 2: [Column 2 Value]  
New Column 3: [Column 2 Value] + 1  
  
If the value of column 6 is "-", then:  
New Column 2: [Column 3 Value] - 1  
New Column 3: [Column 3 Value]  

Example TSV Data Structure (no | present in actual data)

| AI     | 828     | 878     | ABC4807:78485:5:79215       | 42 | - |
| AI     | 971     | 1021    | ABC248:78485:5:79215:46065  | 42 | + |
| AI     | 1104    | 1153    | X7481:78485:5:79215:40174   | 35 | - |
| XVDIII | 56939   | 56988   | 9478:78485:5:79215:30872    | 42 | - |
| XVDIII | 56971   | 57020   | 7841S:78485:5:79215:34301   | 42 | - |
| UTXV   | 1043196 | 1043246 | T885189:78485:5:79215:10036 | 42 | + |
| UTXV   | 1043198 | 1043248 | C74581:78485:5:79215:10792  | 42 | - |
| UTXV   | 1043201 | 1043250 | T75S17:78485:5:79215:30204  | 42 | - |
| UTXV   | 1043201 | 1043251 | B784W7:78485:5:79215:42548  | 42 | - |

Desired TSV Output

| AI     | 877     | 878     | ABC4807:78485:5:79215       | 42 | - |
| AI     | 971     | 972     | ABC248:78485:5:79215:46065  | 42 | + |
| AI     | 1152    | 1153    | X7481:78485:5:79215:40174   | 35 | - |
| XVDIII | 56987   | 56988   | 9478:78485:5:79215:30872    | 42 | - |
| XVDIII | 57019   | 57020   | 7841S:78485:5:79215:34301   | 42 | - |
| UTXV   | 1043196 | 1043197 | T885189:78485:5:79215:10036 | 42 | + |
| UTXV   | 1043247 | 1043248 | C74581:78485:5:79215:10792  | 42 | - |
| UTXV   | 1043249 | 1043250 | T75S17:78485:5:79215:30204  | 42 | - |
| UTXV   | 1043250 | 1043251 | B784W7:78485:5:79215:42548  | 42 | - |

Solution

  • The algorithm to replace the the values is not the problem. The interesting point is to maintain the fixed width columns.

    File: dataconvert.awk

    #!/usr/bin/gawk -f
    
    BEGIN {
        OFS = FS = "|"
    }
    {
        if ($7 ~ /+/) {
            len = length($4)
            $4 = $3
            gsub(/\s+/, "", $4)
            $4 = substr((" " ($4 + 1) "                       "), 0, len)
        } else {
            if ($7 ~ /-/) {
                len = length($3)
                $3 = $4
                gsub(/\s+/, "", $3)
                $3 = substr((" " ($3 - 1) "                       "), 0, len)
            }
        } 
        print
    }
    

    Mark the as executable by chmod +x dataconvert.awk an you can start it directly calling ./dataconvert.awk

    In this solution the width of the column's content is stored, the result filled with a save amount of spaces and finally cropped to the stored length.

    If the dataset really has no exceptions in providing the values + or - at column 6, then the second if clause can be omitted. At huge datasets this can save some time.

    As the dataset starts with a field delimiter the first column is always empty but exists. So the column count has to be +1 to the intuitive way to count.