Search code examples
bashcommand-linerandomline-processing

Get random lines from large files in bash


How can I get n random lines from very large files that can't fit in memory.

Also it would be great if I could add filters before or after the randomization.


update 1

in my case the specs are :

  • > 100 million lines
  • > 10GB files
  • usual random batch size 10000-30000
  • 512RAM hosted ubuntu server 14.10

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem


Solution

  • #!/bin/bash
    #contents of bashScript.sh
    
    file="$1";
    lineCnt=$2;
    filter="$3";
    nfilter="$4";
    echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;
    
    totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
    echo "filtered count : $totalLineCnt" 1>&2;
    
    chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
    echo "chances : $chances" 1>&2;
    
    cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";
    

    usage:

    get 1000 random sample

    bashScript.sh /path/to/largefile.txt 1000  
    

    line has numbers

    bashScript.sh /path/to/largefile.txt 1000 "[0-9]"
    

    no mike and jane

    bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"