I have a tab-separated data and it looks like this:
a 1a,2x,c1
b2 a4,4.6
3c 323
The second column has multiple comma seperated values. I want to get this output:
a 1a
a 2x
a c1
b2 a4
b2 4.6
3c 323
I was able to do it with this python code I wrote:
import sys
f = sys.argv[1]
with open(f) as f:
for line in f:
line = line.strip("\n").split("\t")
genes = line[1].split(",")
for gene in genes:
print(line[0],gene, sep="\t")
I know I can do the same with a bash script but I would like to know how can I do this with a cool bash oneliner, using awk, sed, tr and/or cut without using a for loop.
I couldn't go any further than this:
tr ',' '\n' data
EDIT: As per OP's request without loop will be(tested and written with provided samples only),(Fair warning: gsub
version with a pipe is curiosity from OP and it is both more fragile and slower than just using a for loop and keeping all processing inside of awk
):
awk '{gsub(/,/,ORS $1 OFS)} 1' Input_file | column -t
Brief explanation: Using gsub
function of awk
to globally substitute all occurrences of ,
in each line with ORS(new line by default it value) $1(first field as per OP's requirement) OFS(space by default its value). Then mentioning 1
will print edited/non-edited line here. Then passing awk
command's output to column
command to beautify its output with same space.
Could you please try following.
awk '{num=split($2,array,",");for(i=1;i<=num;i++){print $1,array[i]}}' Input_file