Search code examples
bashawkdatatableconcatenation

rearranging a table with bash/awk


I have a file with almost 2 hundred columns, each with a varying number of rows similar to the following (but approximately 80 rows in length):

column1  column2 column3 column4....
abc      def     ghi     jki
lmn      opq     rst     uvw
xyz              abc     def
ghi              jkl

what I need to do is effectively concatenate the columns with column1 being the name of the original column and column2 being the values that were in the rows, something like the following:

column1  abc
column1  lmn
column1  xyz
column1  ghi
column2  def
column2  opq
column3  ghi
column3  rst
column3  abc
column3  jkl
column4  ...

I have very limited skills and have tried to piece together the following, but I'm having no luck:

awk -F"\t" '{ for (i=1;i<=NF;i++) { l=length($i) ; if ( l > linesize[i] ) linesize[i]=l ; }} END \
{ for (l=1;l <= NF; l++) for (j=2;j<=NR;j++) printf "%d/t%d\n",NR==1($l),!(NR==$j&&NF==$l)="" ;; }'\
 file_in.txt > file_out.txt

I've done extensive searching and have been unable to find anything that helps (I know for sure that some of the setting of the NR and NF is wrong, but I'm not sure how to fix it), so any help would be much appreciated. Thanks


Solution

  • It's long enough to do in pure awk that it's easier to throw it all in a script file of its own instead of doing it as a one-liner:

    #!/usr/bin/awk -f
    BEGIN { FS = "\t" }
    NR == 1 {
        for (c = 1; c <= NF; c++)
            headers[c] = $c
        maxcol = NF
        next
    }
    {
        for (c = 1; c <= NF; c++)
            if ($c != "")
                results[c] = results[c] headers[c] "\t" $c "\n"
    }
    END {
        for (c = 1; c <= maxcol; c++)
            printf "%s", results[c]
    }
    

    Usage:

    $ ./rowify.awk < input.tsv
    column1 abc
    column1 lmn
    column1 xyz
    column1 ghi
    column2 def
    column2 opq
    column3 ghi
    column3 rst
    column3 abc
    column3 jkl
    column4 jki
    column4 uvw
    column4 def
    

    This just appends each column name + column value as a new line in a string to the appropriate entry in an array, and then prints them all out at the end.

    Alternative one liner with awk in combination with GNU datamash:

    $ datamash --no-strict transpose < input.tsv | awk '{ for (c = 2; c <= NF; c++) if ($c != "N/A") print $1 "\t" $c }'
    column1 abc
    column1 lmn
    column1 xyz
    column1 ghi
    column2 def
    column2 opq
    column3 ghi
    column3 rst
    column3 abc
    column3 jkl
    column4 jki
    column4 uvw
    column4 def
    

    The transpose command switches rows and columns, so lines looks like column1 abc lnm ... and then the awk just splits that up into individual lines.