Search code examples
xpathxqueryexist-db

xquery randomly selecting files without duplicating the selection


In Xquery 3.1 (in eXist 4.7) I have 40 XML files, and I need to select 4 of them at random. However I would like the four files to be different.

My files are all in the same collection ($data). I currently count the files, then use a randomising function (util:random($max as xs:integer)) to generate position() in sequence of files to select four of them:

let $filecount := count($data)
for $cnt in 1 to 4
let $pos := util:random($filecount)
return $data[position()=$pos]

But this often results in the same files being selected multiple times by chance.

Each file has a distinct @xml:id (in the root node of each file) which can allow me, if possible, to use that as some sort of predicate in recursion. But I'm unable to identify a method for somehow accruing the @xml:ids into a cumulative, recursive sequence.

Thanks for any help.


Solution

  • I think the standardized random-numer-generator function and its permute function (https://www.w3.org/TR/xpath-functions/#func-random-number-generator) should give you better "randomness" and diverse results e.g.

    let $file-count := count($data)
    return $data[position() = random-number-generator(current-dateTime())?permute(1 to $file-count)[position() le 4]]
    

    I haven't tried that with your db/XQuery implementation and it might be there are also ways with the functions you currently use.

    For eXist-db I guess one strategy is to call the random-number function until you have got a distinct sequence of the wanted number of values, the following returns (at least in some tests with eXide)) four distinct numbers between 1 and 40 on each call:

    declare function local:random-sequence($max as xs:integer, $length as xs:integer) as xs:integer+ {
        local:random-sequence((), $max, $length)
    };
    
    declare function local:random-sequence($seq as xs:integer*, $max as xs:integer, $length as xs:integer) as xs:integer+ {
        if (count($seq) = $length and $seq = distinct-values($seq))
        then $seq
        else local:random-sequence((distinct-values($seq), util:random($max)), $max, $length)
    };
    
    let $file-count := 40
    return local:random-sequence($file-count, 4)
    

    Integrating that in the previous attempt would result in

    let $file-count := count($data)
    return $data[position() = local:random-sequence($file-count, 4)]
    

    As for your comment, I didn't notice the exist util:random function can return 0 and excludes the max value so based on your comment and a further test I guess you rather want the function I posted above to be implemented as

    declare function local:random-sequence($seq as xs:integer*, $max as xs:integer, $length as xs:integer) as xs:integer+ {
        if (count($seq) = $length)
        then $seq
        else
            let $new-number := util:random($max + 1)
            return if ($seq = $new-number or $new-number = 0)
                   then local:random-sequence($seq, $max, $length)
                   else local:random-sequence(($seq, $new-number), $max, $length)
    };
    

    That way it hopefully now returns $length distinct values between 1 and the $max argument.