Search code examples
perlawkdata-manipulation

Split large csv file into multiple files based on column(s)


I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.

#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z

I now want multiple files like this:

# contents of 1.csv
1,2
a,b
q,w
a,s
z,x

# contents of 2.csv
1,3
a,c
q,e
a,d
z,c

# contents of 3.csv
1,4
a,d
q,r
a,f
z,v

and so on...

I can do this currently with awk on small files (say 30 columns) like this:

awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv

The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.

Thanks in advance.


Solution

  • With your show samples, attempts; please try following awk code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.

    awk '
    BEGIN{ FS=OFS="," }
    {
      for(i=1;i<NF;i++){
        value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i+1))
      }
    }
    END{
      for(i=1;i<=NF;i++){
        outFile=i".csv"
        print value[i] > (outFile)
        close(outFile)
      }
    }
    ' large_file.csv