Search code examples
awksplitgzipcat

Split a file into multiple gzip files in one line


Is it possible to split a file into multiple gzip files in one line?

Lets say I have a very large file data.txt containing

A somedata 1
B somedata 1
A somedata 2
C somedata 1
B somedata 2

I would like to split each into separate directory of gz files.

For example, if I didnt care about separating, I would do

cat data.txt | gzip -5 -c | split -d -a 3 -b 100000000 - one_dir/one_dir.gz.

And this will generate gz files of 100MB chunks under one_dir directory.

But what I want is separating each based on the first column. So I would like to have say 3 different directory, containing gz files of 100MB chunks for A, B and C respectively.

So the final directory will look like

A/
  A.gz.000
  A.gz.001
  ...
B/
  B.gz.000
  B.gz.001
  ...
C/
  C.gz.000
  C.gz.001
  ...

Can I do this in a 1 liner using cat/awk/gzip/split? Can I also have it create the directory (if it doesnt exist yet)


Solution

  • With awk:

    awk '
       !d[$1]++ {
          system("mkdir -p "$1)
          c[$1] = "gzip -5 -c|split -d -a 3 -b 100000000 - "$1"/"$1".gz."
       }
       { print | c[$1] }
    ' data.txt
    

    Assumes:

    • sufficiently few distinct $1 (there is an implementation-specific limit on how many pipes can be active simultaneously - eg. popen() on my machine seems to allow 1020 pipes per process)
    • no problematic characters in $1

    Incorporating improvements suggested by @EdMorton:

    • If you have a sort that supports -s (so-called "stable sort"), you can remove the first limit above as only a single pipe will need to be active.
    • You can remove the second limit by suitable testing and quoting before you use $1. In particular, unescaped single-quotes will interfere with quoting in the constructed command; and forward-slash is not valid in a filename. (NUL (\0) is not allowed in a filename either but should never appear in a text file.)
    sort -s -k1,1 data.txt | awk '
       $1 ~ "/" {
          print "Warning: unsafe character(s). Ignoring line",FNR >"/dev/stderr"
          next
       }
       $1 != prev {
          close(cmd)
          prev = $1
    
          # escape single-quote (\047) for use below
          s = $1
          gsub(/\047/,"\047\\\047\047",s)
    
          system("mkdir -p -- \047"s"\047")
          cmd = "gzip -5 -c|split -d -a 3 -b 100000000 -- - \047"s"/"s".gz.\047"
       }
       { print | cmd }
    '
    

    Note that the code above still has gotchas:

    • for a path d1/d2/f:
      • the total length can't exceed getconf PATH_MAX d1/d2; and
      • the name part (f) can't exceed getconf NAME_MAX d1/d2

    Hitting the NAME_MAX limit can be surprisingly easy: for example copying files onto an eCryptfs filesystem could reduce the limit from 255 to 143 characters.