Search code examples
awksedcuttr

Splitting second column of a line to create multiple lines with a bash oneliner


I have a tab-separated data and it looks like this:

a   1a,2x,c1
b2  a4,4.6
3c  323

The second column has multiple comma seperated values. I want to get this output:

a   1a
a   2x
a   c1
b2  a4
b2  4.6
3c  323

I was able to do it with this python code I wrote:

import sys
f = sys.argv[1]

with open(f) as f:
    for line in f:
        line = line.strip("\n").split("\t")
        genes = line[1].split(",")
        for gene in genes:
            print(line[0],gene, sep="\t")

I know I can do the same with a bash script but I would like to know how can I do this with a cool bash oneliner, using awk, sed, tr and/or cut without using a for loop.

I couldn't go any further than this:

tr ',' '\n' data


Solution

  • EDIT: As per OP's request without loop will be(tested and written with provided samples only),(Fair warning: gsub version with a pipe is curiosity from OP and it is both more fragile and slower than just using a for loop and keeping all processing inside of awk):

    awk '{gsub(/,/,ORS $1 OFS)} 1'  Input_file | column -t
    

    Brief explanation: Using gsub function of awk to globally substitute all occurrences of , in each line with ORS(new line by default it value) $1(first field as per OP's requirement) OFS(space by default its value). Then mentioning 1 will print edited/non-edited line here. Then passing awk command's output to column command to beautify its output with same space.

    Could you please try following.

    awk '{num=split($2,array,",");for(i=1;i<=num;i++){print $1,array[i]}}' Input_file