Search code examples
bashline

How to delete all lines bigger than 1GB in length in a text file?


I have a text file that might contain long lines. How can I delete all the lines whose length is more than 1GB in that file and keep only the lines that are smaller than 1GB? Thanks


Solution

  • I believe any solution to your question will either require reading the file multiple times or reading lines into a buffer that is at least 1GB large.

    A naïve solution in bash does the latter and is likely to crash:

    #!/bin/bash
    while IFS= read -r line; do
        if [ ${#line} -le 1000000000 ]; then
            echo "$line"
        fi
    done <infile >tmpfile
    mv tmpfile infile
    

    It will run very slowly, and from a quick test I think it will need something like 3x as much RAM as the longest line.


    We can read into a smaller buffer to avoid this, but the code is much more complicated and still runs extremely slowly. For example:

    #!/bin/bash
    
    max=1000000000
    buflen=33554432
    
    len=0
    data="$(tempfile)"
    
    savedata(){
        printf "%s" "$1" >>"$data"
        (( len+=${#1} ))
    }
    
    cleardata(){
        cat /dev/null >"$data"
        len=0
    }
    
    maybeprintdata(){
        if (( len<max )); then
            cat "$data"
            (( noecho )) || echo
        fi
    }
    
    (
        while IFS= read -n $buflen -r line || [ -n "$line" ]; do
            savedata "$line"
            if (( ${#line}!=buflen )); then
                maybeprintdata
                cleardata
            fi
        done 
        (( len )) && noecho=1 maybeprintdata
    
    ) <infile >tmpfile
    mv tmpfile infile
    
    rm "$data"
    

    If you are not limited to bash, much faster programs are possible.

    A "one-liner" Perl equivalent of the naïve bash solution might be:

    perl -i -nlE 'length>1e9 || say' file
    
    • -i makes changes to file in-place
    • -n wraps an implicit iterate-over-lines loop around the program
    • 1e9 is a short form for 1000000000
    • say is like bash's echo

    Note that unlike the "complicated" bash program above, this simple Perl program outputs a final newline even if the input didn't have one.

    Note also that it needs as much RAM as the longest file line (this could be a problem if line-lengths could exceed memory).