I am trying to create a list which contains a random sample (more specifically, a stratified sample).
My data consists of a list with several million telephone numbers (a string for each) which I splitted into a list containing 2 strings (for each number). The first string is the city code, by which the sample has to be stratified. I used
unique = list(set(citycode))
to get all unique elements from the main list (mainlist[0]).
Suppose I have ~1000 elements in list 'unique' and for each unique element I am trying to find 5 elements in 'mainlist' randomly which contain the number of unique[i] in mainlist[i][0]. For each match, both fields/strings of mainlist shall be appended to a new list, 'randomlist'. So the final list should contain 5000 telephone numbers.
I thought of using nested loops for this, but as I am a beginner in Python and trying to use online tutorials to teach myself, I haven't really found a function or way to solve this. I am not sure in this case what would be a possible way of solving it.
Any ideas or input would be greatly appreciated. Thank you!
Assuming two lists like:
main = [(123, xxxxxxx),...]
unique = [123, ...]
Then you can do something like:
from random import shuffle
shuffle(main)
out = []
for u in unique:
i = 0
it = (x for x in main if x[0] == u)
while i < 5:
try:
out.append(main.pop(main.index(next(it))))
except:
pass
i+=1
out
will contain a list of tuples like those found in main, up to 5 per unique area code (will be less than 5, if main contains less than 5), randomly distributed.
UPDATE
Since you want to exclude area codes with too little representation all together, here's how you do that:
from random import shuffle
from collections import Counter
c = Counter(x[0] for x in main)
main = [x for x in main if c[x] >= 5]
shuffle(main)
out = []
for u in unique:
i = 0
it = (x for x in main if x[0] == u)
while i < 5:
out.append(main.pop(main.index(next(it))))
i+=1