Search code examples
bashawktextsed

Using unix (awk, sed, bash?) to truncate items in a column by the 4th underscore?


I have series of files that look like this, the second and third column are duplicates but with thousands of lines.

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2_i1.p1 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1_i2.p1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1_i1.p1 6.26e-66

I want to take the 3rd column and truncate it so that everything in the string after and including _i is deleted, like so:

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66

The numbers after each letter combination (DN, c, g, i, p) could be anything and could also be any length, so I can't just truncate to a certain length.

I've tried sed -i 's/_i.*//' file.txt But this deleted everything after each line and not just the column of interest.

Thanks so much!


Solution

  • awk '{sub(/_[^_]+$/,"",$3)}1' file
    
    AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
    AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
    AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66
    

    in 3rd field delete everything after the last underscore (including).