How to sample 50 random files from my dataset witch each file having the same probabilty to be taken in shell script?

does find /mnt/Dataset/ -type f | shuf -n 50 is doing the trick? Does shuf wait to count all the lines then do a random selection? Does shuf give the same probability to each line? Or should I use another tool?

Solution

When you are wondering how shuf works with the pipeline (wait for the pipeline to be finished or process data when it is available, you can write a test. The test will look like:

for ((i=0; i<20; i++)); do
  (printf "%s\n" {1..9}; sleep 0.1; echo 10) | shuf | tr '\n' ' '
  echo
done

This test is without the -n option and you want a larger sample to look at the averages. The next loop is better for testing

for ((i=0; i<10000; i++)); do
  (printf "%s\n" {1..9}; sleep 0.01; echo 10) | shuf | tr '\n' ' '
  echo
done > sample.txt
# Look for how often 10 is the last number on a line
grep -c "10 $" sample.txt

I also did a test:

cut -d " " -f1 sample.txt | sort | uniq -c
   1040 1
    985 10
    976 2
   1012 3
    981 4
    999 5
   1043 6
    974 7
    979 8
   1011 9

I did not check the distribution with the sample size, but it feels like a good random distribution.