How can I get n
random lines from very large files that can't fit in memory.
Also it would be great if I could add filters before or after the randomization.
in my case the specs are :
so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem
#!/bin/bash
#contents of bashScript.sh
file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;
totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;
chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;
cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";
get 1000 random sample
bashScript.sh /path/to/largefile.txt 1000
line has numbers
bashScript.sh /path/to/largefile.txt 1000 "[0-9]"
no mike and jane
bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"