Search code examples
bashawk

How to separate column using character length with AWK


I have a data, which is not formatted according to proper field separator (which is awk good at). But, what I know is, the data is fixed width.

 NODE     S1           S2           S3           SINT         SEQV    
   1  0.14919     -0.58396E-001-0.71230      0.86149      0.77873     
   2  0.56037E-001 0.23261E-002-0.37154      0.42757      0.40341     
   3  0.52036E-001 0.19762E-001-0.27222      0.32426      0.30939     
   4  0.59765E-001 0.22059E-001-0.24529      0.30505      0.28806     
   5  0.70704E-001-0.51976E-002-0.13862      0.20932      0.18354     
   6  0.11906      0.44607E-001-0.17493      0.29399      0.26474     
   7  0.25540      0.95993E-002-0.43110      0.68650      0.60246     
   8  0.52246E-001-0.47008E-001-0.35167      0.40391      0.36456     
   9  0.32215E-001-0.62291E-001-0.28800      0.32021      0.28497     
  10  0.28072E-001-0.68269E-001-0.28304      0.31111      0.27586     
  11  0.25990E-001-0.78663E-001-0.28626      0.31225      0.27527     
  12  0.26657E-001-0.79217E-001-0.29507      0.32173      0.28400     

The expected output is something like this (the numbers in other columns can have format like in S2, in other parts of the data):

  NODE       S1           S2           S3           SINT         SEQV    
   1    0.14919       -0.58396E-001  -0.71230      0.86149      0.77873     
   2    0.56037E-001   0.23261E-002  -0.37154      0.42757      0.40341     
   3    0.52036E-001   0.19762E-001  -0.27222      0.32426      0.30939     
   4    0.59765E-001   0.22059E-001  -0.24529      0.30505      0.28806     
   5    0.70704E-001  -0.51976E-002  -0.13862      0.20932      0.18354     
   6    0.11906        0.44607E-001  -0.17493      0.29399      0.26474     
   7    0.25540        0.95993E-002  -0.43110      0.68650      0.60246     
   8    0.52246E-001  -0.47008E-001  -0.35167      0.40391      0.36456     
   9    0.32215E-001  -0.62291E-001  -0.28800      0.32021      0.28497     
  10    0.28072E-001  -0.68269E-001  -0.28304      0.31111      0.27586     
  11    0.25990E-001  -0.78663E-001  -0.28626      0.31225      0.27527     
  12    0.26657E-001  -0.79217E-001  -0.29507      0.32173      0.28400     

The main problem is, sometimes the column are separated by space, sometime no space ('-' sign occupies the space), that is, no proper field separator. I found a similar question here, but that data was consistent in terms of field separator. There might be two ways, according to my thinking,

  1. Use character length. For example, starting from the second line, the characters are either 74 or 75 per line.
  2. Use field width: Don't know if it's possible with AWK.

I'm new to AWK, I know it might be very easy with some other tools, but I would like to know if it's possible to separate/extract these columns using awk. I am using awk in terminal in MacOS.


Solution

  • Although the answers by JNevill & Andriy Makukha are fine in their own rights, I wanted to solve the problem specifically using NAWK (MacOS default AWK). I have found that "FIELDWIDTHS" is a gawk only feature Grymoire,machelp.

    The catch is to use

    FS=""
    

    which would consider each character as a field. As the fields are of equal width, it's possible to extract each column when one knows the number of characters in each column. For example, if I want to extract the first, second, third and the last column, I can use the following code (if the source data is named as: test_input.txt):

    awk 'BEGIN{FS=""}{print $1$2$3$4"\t"$5$6$7$8$9$10$11$12$13$14$15$16$17$18"\t"$19$20$21$22$23$24$25$26$27$28$29$30$31"\t"$59$60$61$62$63$64$65$66$67$68$69$70}' test_input.txt
    

    output:

    NODE         S1             S2             SEQV    
       1      0.14919       -0.58396E-001   0.77873     
       2      0.56037E-001   0.23261E-002   0.40341     
       3      0.52036E-001   0.19762E-001   0.30939     
       4      0.59765E-001   0.22059E-001   0.28806     
       5      0.70704E-001  -0.51976E-002   0.18354     
       6      0.11906        0.44607E-001   0.26474     
       7      0.25540        0.95993E-002   0.60246     
       8      0.52246E-001  -0.47008E-001   0.36456     
       9      0.32215E-001  -0.62291E-001   0.28497     
      10      0.28072E-001  -0.68269E-001   0.27586     
      11      0.25990E-001  -0.78663E-001   0.27527     
      12      0.26657E-001  -0.79217E-001   0.28400
    

    This is a simple but somewhat unsophisticated solution, but it works for me as of now, for large data. Any further refinement would be welcome...