I want to split a huge file (big.txt). by given line numbers. For example, if the give numbers are 10 15 30
, I will get 4 files: 1-10
, 11-15
, 16-30
, and 30 - EOF
of the big.txt.
Solving the problem is not a challenge for me, I wrote 3 different solutions. However, I cannot explain the performance. Why the awk script is the slowest. (GNU Awk)
For the big.txt, I just did seq 1.5billion > big.txt (~15Gb)
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=0 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[@]}"
do
# Extract the lines
head -n $i "$INPUT_FILE" | tail -n +$START > "file$IDX.txt"
#
(( IDX++ ))
START=$(( i+1 ))
done
# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=1 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[@]}"
do
T=$(( i+1 ))
# Extract the lines using sed command
sed -n -e " $START, $i p" -e "$T q" "$INPUT_FILE" > "file$IDX.txt"
(( IDX++ ))
START=$T
done
# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"
awk -v nums="400000 700000 1200000" 'BEGIN{c=split(nums,a)} {
for(i=1; i<=c; i++){
if( NR<=a[i] ){
print > "file" i ".txt"
next
}
}
print > "file" c+1 ".txt"
}' big.txt
From my testing (using time
command), the head+tail is the fastest:
real 73.48
user 1.42
sys 17.62
the sed one:
real 144.75
user 105.68
sys 15.58
the awk one:
real 234.21
user 187.92
sys 3.98
The awk went through the file only once, why it is much slower than the other two? Also, I thought the tail and head would be the slowest solution, how come it's so fast? I guess it might be something to do with the awk's redirection? (print > file)
Can someone explain it to me? Thank you.
Can awk
be faster than head
and tail
for this?
No, it will be slower, at least for a reasonable number of chunks for a large input file. Because it will read every line and do some work with it. On the other hand, head
and tail
will read massively the newline characters, without doing anything, will seek until they find the line number provided by the argument. Then they don't have again to read line by line and decide what to do, but dump the content, similar to cat
.
If we increase the number of chunks, if the array of splitting line numbers is getting larger and larger, then we will reach a point where the cost of calling many head
and tail
processes will overcome the cost of one awk
process, and from that point after, awk
would be faster.
awk
script improvement
This awk
is slow because of that loop! Just think that for the last output file, for every line to print, we run 4 iterations until we print the line. Of course the time complexity still remains linear to the input, but all these checks and assignments have costs that can be observed as input grows. It can be much improved, e.g. like this:
> cat tst.awk
BEGIN {
a[1]
a[40000]
a[70000]
a[120000]
}
NR in a {
close(out)
out = "file" ++i ".txt"
}
{ print > out }
Here we test only NR per line, actually we almost only print.
awk -f tst.awk big.txt
Testing
Here is some basic testing, I did a file, not huge, 5.2M lines.
> wc -l big.txt
5288558 big.txt
Now, with that loop, it really matters where you split the file! If you have to write most of the rows into the last chunks, that means more iterations, it is slower
> head -1 test.sh
awk -v nums="100000 200000 300000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh
real 0m10.960s
user 0m10.823s
sys 0m0.066s
If most rows goes to first file (that means one iteration and next) it becomes faster!
> head -1 test.sh
awk -v nums="5000000 5100000 5200000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh
real 0m6.914s
user 0m6.838s
sys 0m0.043s
With the above modification it should be fast enough regardless the cut points.
> time awk -f tst.awk big.txt
real 0m4.270s
user 0m4.185s
sys 0m0.048s