Is it possible to split a file into multiple gzip files in one line?
Lets say I have a very large file data.txt
containing
A somedata 1
B somedata 1
A somedata 2
C somedata 1
B somedata 2
I would like to split each into separate directory of gz files.
For example, if I didnt care about separating, I would do
cat data.txt | gzip -5 -c | split -d -a 3 -b 100000000 - one_dir/one_dir.gz.
And this will generate gz files of 100MB chunks under one_dir
directory.
But what I want is separating each based on the first column. So I would like to have say 3 different directory, containing gz files of 100MB chunks for A, B and C respectively.
So the final directory will look like
A/
A.gz.000
A.gz.001
...
B/
B.gz.000
B.gz.001
...
C/
C.gz.000
C.gz.001
...
Can I do this in a 1 liner using cat/awk/gzip/split? Can I also have it create the directory (if it doesnt exist yet)
With awk:
awk '
!d[$1]++ {
system("mkdir -p "$1)
c[$1] = "gzip -5 -c|split -d -a 3 -b 100000000 - "$1"/"$1".gz."
}
{ print | c[$1] }
' data.txt
Assumes:
$1
(there is an implementation-specific limit on how many pipes can be active simultaneously - eg. popen() on my machine seems to allow 1020 pipes per process)$1
Incorporating improvements suggested by @EdMorton:
sort
that supports -s
(so-called "stable sort"), you can remove the first limit above as only a single pipe will need to be active.$1
. In particular, unescaped single-quotes will interfere with quoting in the constructed command; and forward-slash is not valid in a filename. (NUL (\0) is not allowed in a filename either but should never appear in a text file.)sort -s -k1,1 data.txt | awk '
$1 ~ "/" {
print "Warning: unsafe character(s). Ignoring line",FNR >"/dev/stderr"
next
}
$1 != prev {
close(cmd)
prev = $1
# escape single-quote (\047) for use below
s = $1
gsub(/\047/,"\047\\\047\047",s)
system("mkdir -p -- \047"s"\047")
cmd = "gzip -5 -c|split -d -a 3 -b 100000000 -- - \047"s"/"s".gz.\047"
}
{ print | cmd }
'
Note that the code above still has gotchas:
d1/d2/f
:
getconf PATH_MAX d1/d2
; andf
) can't exceed getconf NAME_MAX d1/d2
Hitting the NAME_MAX limit can be surprisingly easy: for example copying files onto an eCryptfs filesystem could reduce the limit from 255 to 143 characters.