Search code examples
rstatisticsoversampling

Preferentially Sampling Based upon Value Size


So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.

I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {​​​​​​​​x-min(x)}​​​​​​​​/{​​​​​​​​max(x)-min(x)}​​​​​​​​ method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.

I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.

Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?

# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))

# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????

# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```

Solution

  • There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope. In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.

    area.dist = rgamma(1000,1,3)*40
    area.dist[area.dist>100]=100
    hist(area.dist,main="Probability functions")
    
    area = seq(0,100,0.1)
    prob_vector1 = 1-(area-min(area))/(max(area)-min(area))  ## linear
    prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
    prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4  ## strong curve
    prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area))))  ## low curve
    legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
    
    
    lines(area,prob_vector1*500,col="red")
    lines(area,prob_vector2*500,col="green")
    lines(area,prob_vector3*500,col="blue")
    lines(area,prob_vector4*500,col="orange")
    

    The output is: Output

    The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.