I have a file with almost 2 hundred columns, each with a varying number of rows similar to the following (but approximately 80 rows in length):
column1 column2 column3 column4....
abc def ghi jki
lmn opq rst uvw
xyz abc def
ghi jkl
what I need to do is effectively concatenate the columns with column1 being the name of the original column and column2 being the values that were in the rows, something like the following:
column1 abc
column1 lmn
column1 xyz
column1 ghi
column2 def
column2 opq
column3 ghi
column3 rst
column3 abc
column3 jkl
column4 ...
I have very limited skills and have tried to piece together the following, but I'm having no luck:
awk -F"\t" '{ for (i=1;i<=NF;i++) { l=length($i) ; if ( l > linesize[i] ) linesize[i]=l ; }} END \
{ for (l=1;l <= NF; l++) for (j=2;j<=NR;j++) printf "%d/t%d\n",NR==1($l),!(NR==$j&&NF==$l)="" ;; }'\
file_in.txt > file_out.txt
I've done extensive searching and have been unable to find anything that helps (I know for sure that some of the setting of the NR and NF is wrong, but I'm not sure how to fix it), so any help would be much appreciated. Thanks
It's long enough to do in pure awk
that it's easier to throw it all in a script file of its own instead of doing it as a one-liner:
#!/usr/bin/awk -f
BEGIN { FS = "\t" }
NR == 1 {
for (c = 1; c <= NF; c++)
headers[c] = $c
maxcol = NF
next
}
{
for (c = 1; c <= NF; c++)
if ($c != "")
results[c] = results[c] headers[c] "\t" $c "\n"
}
END {
for (c = 1; c <= maxcol; c++)
printf "%s", results[c]
}
Usage:
$ ./rowify.awk < input.tsv
column1 abc
column1 lmn
column1 xyz
column1 ghi
column2 def
column2 opq
column3 ghi
column3 rst
column3 abc
column3 jkl
column4 jki
column4 uvw
column4 def
This just appends each column name + column value as a new line in a string to the appropriate entry in an array, and then prints them all out at the end.
Alternative one liner with awk
in combination with GNU datamash:
$ datamash --no-strict transpose < input.tsv | awk '{ for (c = 2; c <= NF; c++) if ($c != "N/A") print $1 "\t" $c }'
column1 abc
column1 lmn
column1 xyz
column1 ghi
column2 def
column2 opq
column3 ghi
column3 rst
column3 abc
column3 jkl
column4 jki
column4 uvw
column4 def
The transpose
command switches rows and columns, so lines looks like column1 abc lnm ...
and then the awk just splits that up into individual lines.