I have a text file that might contain long lines. How can I delete all the lines whose length is more than 1GB in that file and keep only the lines that are smaller than 1GB? Thanks
I believe any solution to your question will either require reading the file multiple times or reading lines into a buffer that is at least 1GB large.
A naïve solution in bash does the latter and is likely to crash:
#!/bin/bash
while IFS= read -r line; do
if [ ${#line} -le 1000000000 ]; then
echo "$line"
fi
done <infile >tmpfile
mv tmpfile infile
It will run very slowly, and from a quick test I think it will need something like 3x as much RAM as the longest line.
We can read into a smaller buffer to avoid this, but the code is much more complicated and still runs extremely slowly. For example:
#!/bin/bash
max=1000000000
buflen=33554432
len=0
data="$(tempfile)"
savedata(){
printf "%s" "$1" >>"$data"
(( len+=${#1} ))
}
cleardata(){
cat /dev/null >"$data"
len=0
}
maybeprintdata(){
if (( len<max )); then
cat "$data"
(( noecho )) || echo
fi
}
(
while IFS= read -n $buflen -r line || [ -n "$line" ]; do
savedata "$line"
if (( ${#line}!=buflen )); then
maybeprintdata
cleardata
fi
done
(( len )) && noecho=1 maybeprintdata
) <infile >tmpfile
mv tmpfile infile
rm "$data"
If you are not limited to bash, much faster programs are possible.
A "one-liner" Perl equivalent of the naïve bash solution might be:
perl -i -nlE 'length>1e9 || say' file
-i
makes changes to file
in-place-n
wraps an implicit iterate-over-lines loop around the program1e9
is a short form for 1000000000say
is like bash's echo
Note that unlike the "complicated" bash program above, this simple Perl program outputs a final newline even if the input didn't have one.
Note also that it needs as much RAM as the longest file line (this could be a problem if line-lengths could exceed memory).