bash command-line random line-processing

Get random lines from large files in bash

How can I get n random lines from very large files that can't fit in memory.

Also it would be great if I could add filters before or after the randomization.

update 1

in my case the specs are :

> 100 million lines
> 10GB files
usual random batch size 10000-30000
512RAM hosted ubuntu server 14.10

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

Solution

#!/bin/bash
#contents of bashScript.sh

file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;

totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;

chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;

cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";

usage:

get 1000 random sample

bashScript.sh /path/to/largefile.txt 1000

line has numbers

bashScript.sh /path/to/largefile.txt 1000 "[0-9]"

no mike and jane

bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"