How to delete all lines bigger than 1GB in length in a text file?

I have a text file that might contain long lines. How can I delete all the lines whose length is more than 1GB in that file and keep only the lines that are smaller than 1GB? Thanks

Solution

I believe any solution to your question will either require reading the file multiple times or reading lines into a buffer that is at least 1GB large.

A naïve solution in bash does the latter and is likely to crash:

#!/bin/bash
while IFS= read -r line; do
    if [ ${#line} -le 1000000000 ]; then
        echo "$line"
    fi
done <infile >tmpfile
mv tmpfile infile

It will run very slowly, and from a quick test I think it will need something like 3x as much RAM as the longest line.

We can read into a smaller buffer to avoid this, but the code is much more complicated and still runs extremely slowly. For example:

#!/bin/bash

max=1000000000
buflen=33554432

len=0
data="$(tempfile)"

savedata(){
    printf "%s" "$1" >>"$data"
    (( len+=${#1} ))
}

cleardata(){
    cat /dev/null >"$data"
    len=0
}

maybeprintdata(){
    if (( len<max )); then
        cat "$data"
        (( noecho )) || echo
    fi
}

(
    while IFS= read -n $buflen -r line || [ -n "$line" ]; do
        savedata "$line"
        if (( ${#line}!=buflen )); then
            maybeprintdata
            cleardata
        fi
    done 
    (( len )) && noecho=1 maybeprintdata

) <infile >tmpfile
mv tmpfile infile

rm "$data"

If you are not limited to bash, much faster programs are possible.

A "one-liner" Perl equivalent of the naïve bash solution might be:

perl -i -nlE 'length>1e9 || say' file

-i makes changes to file in-place
-n wraps an implicit iterate-over-lines loop around the program
1e9 is a short form for 1000000000
say is like bash's echo

Note that unlike the "complicated" bash program above, this simple Perl program outputs a final newline even if the input didn't have one.

Note also that it needs as much RAM as the longest file line (this could be a problem if line-lengths could exceed memory).