I have a large dataset of 100MB and want to make a chunk of random sample of 500 data. I tried using following but the data is being repeated?
di = sorted(random.sample(current,s))
data.append(di)
If I've understood your question, you're trying to take a random sample of size 500 from the values in a dataset of 100MB in size containing repeats, and you would like the values in your sample to be unique.
The docs for random.sample()
state:
If the population contains repeats, then each occurrence is a possible selection in the sample.
This means that to avoid repeats in the sample of size 500, we need to do something more than simply call random.sample()
.
One possibility is to create a collection of the unique values in the dataset and take our sample from that:
import random
current = [random.randrange(1000) for _ in range(100_000_000)]
print('number of values in original data is', len(current), sep=' ')
s = 500
di = sorted(random.sample(current,s))
print('number of unique values in the sample of size', s, 'is', len(set(di)), sep=' ')
uniq = list(set(current))
print('number of unique values in original data is', len(uniq), sep=' ')
di = sorted(random.sample(uniq,s))
print('number of unique values in the sample of size', s, 'is', len(set(di)), sep=' ')
Output:
number of values in original data is 100000000
number of unique values in the sample of size 500 is 395
number of unique values in original data is 1000
number of unique values in the sample of size 500 is 500
This shows that if we sample from the original dataset current
with repeats, there are fewer than 500 unique values in the sample of size 500. However, if we create a collection uniq
containing only the unique values from the original dataset, we can reliably take a sample of size 500 with all unique values, assuming the number of unique values in the original dataset is at least 500; otherwise, as stated in the docs:
If the sample size is larger than the population size, a ValueError is raised.