Search code examples
csvawkfixed-width

Trouble converting a fixed-width file into a csv


sorry if this is a newbie question, but I didn't find the answer to this particular question on stackoverflow. I have a (very large) fixed-width data file that looks like this: simplefile.txt

ratno      fdate ratname                        typecode country        
12346 31/12/2010 HARTZ                              4    UNITED STATES
12444 31/12/2010 CHRISTIE                           5    UNITED STATES
12527 31/12/2010 HILL AIR                           4    UNITED STATES
15000 31/12/2010 TOKUGAVA  INC.                     5    JAPAN
37700 31/12/2010 HARTLAND                           1    UNITED KINGDOM
37700 31/12/2010 WILDER                             1    UNITED STATES  
18935 31/12/2010 FLOWERS FINAL SERVICES INC         5    UNITED STATES
37700 31/12/2010 MAPLE CORPORATION                  1    CANADA
48614 31/12/2010 SERIAL MGMT  L.P.                  5    UNITED STATES
 1373 31/12/2010 AMORE MGMT GROUP N A               1    UNITED STATES

I am trying to convert it into a csv file using the terminal (the file is too big for Excel) that would look like this:

ratno,fdate,ratname,typecode,country        
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA  INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES 
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT  L.P.,5,UNITED STATES
 1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES

I dug a bit around on this site and found a possible solution that relies on the awk shell command:

awk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}'   "simpletestfile.txt"

However, when I execute the above command in the terminal, it inadvertently also inserts commas in all white spaces, inside the separate words of what is supposed to remain a single field. The result of the above execution is as follows:

ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED,STATES
12444,31/12/2010,CHRISTIE,5,UNITED,STATES
12527,31/12/2010,HILL,AIR,4,UNITED,STATES
15000,31/12/2010,TOKUGAVA,INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED,KINGDOM
37700,31/12/2010,WILDER,1,UNITED,STATES
18935,31/12/2010,FLOWERS,FINAL,SERVICES,INC,5,UNITED,STATES
37700,31/12/2010,MAPLE,CORPORATION,1,CANADA
48614,31/12/2010,SERIAL,MGMT,L.P.,5,UNITED,STATES
1373,31/12/2010,AMORE,MGMT,GROUP,N,A,1,UNITED,STATES

How can I avoid inserting commas in white spaces outside of delineated fieldwidths? Thank you!


Solution

  • Your attempt was good, but requires gawk (gnu awk) for the FIELDWIDTHS built-in variable. With gawk:

    $ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' file
    
    ratno,      fdate, ratname                       , typecode, country
    12346, 31/12/2010, HARTZ                         ,     4   , UNITED STATES
    12444, 31/12/2010, CHRISTIE                      ,     5   , UNITED STATES
    12527, 31/12/2010, HILL AIR                      ,     4   , UNITED STATES
    

    Assuming you don't want the extra spaces, you can do instead:

    $ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{for (i=1; i<=NF; ++i) gsub(/^ *| *$/, "", $i)}1' file
    ratno,fdate,ratname,typecode,country
    12346,31/12/2010,HARTZ,4,UNITED STATES
    12444,31/12/2010,CHRISTIE,5,UNITED STATES
    12527,31/12/2010,HILL AIR,4,UNITED STATES
    

    If you don't have gnu awk, you can achieve the same results with:

    $ awk -v fieldwidths="5 11 31 9 16" '
    BEGIN { OFS=","; split(fieldwidths, widths) }
    {
        rec = $0
        $0 = ""
        start = 1;
        for (i=1; i<=length(widths); ++i) {
            $i = substr(rec, start, widths[i])
            gsub(/^ *| *$/, "", $i)
            start += widths[i]
        }
    }1' file
    
    ratno,fdate,ratname,typecode,country
    12346,31/12/2010,HARTZ,4,UNITED STATES
    12444,31/12/2010,CHRISTIE,5,UNITED STATES
    12527,31/12/2010,HILL AIR,4,UNITED STATES