Search code examples
bashshufflesamplingcommand-line-tool

How to sample 50 random files from my dataset witch each file having the same probabilty to be taken in shell script?


does find /mnt/Dataset/ -type f | shuf -n 50 is doing the trick? Does shuf wait to count all the lines then do a random selection? Does shuf give the same probability to each line? Or should I use another tool?


Solution

  • When you are wondering how shuf works with the pipeline (wait for the pipeline to be finished or process data when it is available, you can write a test. The test will look like:

    for ((i=0; i<20; i++)); do
      (printf "%s\n" {1..9}; sleep 0.1; echo 10) | shuf | tr '\n' ' '
      echo
    done
    

    This test is without the -n option and you want a larger sample to look at the averages. The next loop is better for testing

    for ((i=0; i<10000; i++)); do
      (printf "%s\n" {1..9}; sleep 0.01; echo 10) | shuf | tr '\n' ' '
      echo
    done > sample.txt
    # Look for how often 10 is the last number on a line
    grep -c "10 $" sample.txt
    

    I also did a test:

    cut -d " " -f1 sample.txt | sort | uniq -c
       1040 1
        985 10
        976 2
       1012 3
        981 4
        999 5
       1043 6
        974 7
        979 8
       1011 9
    

    I did not check the distribution with the sample size, but it feels like a good random distribution.