Search code examples
androidkotlinsample

Sampling from an array in android kotlin


I need an idea on doing this. I'm not good at math. Maybe it have build in function which i haven't found yet.

I have an array which consists of 2048 data. I need to get on 250 value out of this.

I'm thinking of

2048/250 = 8.19

which means, I take value on each increment of 8 position in an array.

Is there a function to do this?


Solution

  • Not that I'm aware of, I think the problem is to balance iterations and the randomness of the sampling.

    So the naive approach

    dataSet.indexedMapNotNull { i, data ->
        if (i % 8 == 0) data else null
    }
    

    That would run through all the array, so you only need 250 iterations, not dataSet.size iterations. So what about if we iterate 250 times and for each of those we take the 8th times of it

    val sample = mutableListOf<DataType>()
    for (i in 1..250) {
        val positionInDataSet = (i * 8) - 1 //minus one adjust the index for the size
        val case = dataSet[positionInDataSet]
        sample.add(case)
    }
    
    

    Another alternative would be to simply use copy methods from collections, but the problem is you lose your sampling

    dataSet.subArray(0, 250)
    

    Sub-array didn't sample the data in a pseudo-random way but only got the first 250 and that would be biased. The upside is usually array copies methods are a log of N.

    Another option would be to randomize things even more by not getting data each 8 but a random position until we hit our desired sample size.

    val sample = mutableSetOf<DataType>()
    
    while (sample.size != 250) {
        val randomPosition = Random.nextInt(0, dataSet.size)
        val randomSelection = dataSet[randomPosition]
        sample.add(randomeSelection)
    
    }
    

    Here we use a set, because a Set guarantee unique elements, so you have completely random 250 elements from your data set. The problem with this is that randomness on the position could make the same randomPosition more than once, so you iterate on the data set more than 250 times, this could even be factorial which in larger data sets it would happen and is considered the lowest performance.