Search code examples
pythonrandomsublistresampling

Select one random item from each list nested in a list, different between them, and generate a list n times (RESAMPLING)


I have a txt file with genecodes separated by tab similar to this structure:

ENSG00000111111 ENSG00000111111 ENSG00000111111 ENSG00000111555 
ENSG00000111111 ENSG00000111111 ENSG00000111111 ENSG00000111222  
ENSG00000111111 ENSG00000111111 ENSG00000111111 ENSG00000333555 

and I want to create a list with selecting from each row one item randomly and selected items must be DIFFERENT BETWEEN THEM. At the end I want to repeat the process n times in order to obtain an output file with this structure:

ENSG00000111111 ENSG00000111222 ENSG00000333555
ENSG00000111555 ENSG00000111222 ENSG00000333555
ENSG00000111555 ENSG00000111222 ENSG00000111111
...

(each row correspond to each generated list of random items) . At the moment I have this script: where: all_cand is the txt input file

#!/usr/bin/python
import sys
import os
import random
from itertools
import numpy as np
def rand_cand (all_cand):
    cand_list= []
    main_list = []
    cand_file= open(all_cand, "r")
    for _ in itertools.repeat(None, 10):
        for line in cand_file:
            cand_rows = line.split()
            cand_list.append(cand_rows)

        for item in cand_list:
            aux_old = np.random.choice(item, replace=False)
            if not aux_old in main_list:
                main_list.append(aux_old)
            else:
                aux_new = np.random.choice(item, replace=False)
                main_list.append(aux_new)
    print(main_list)

Related to my script, every generated list contains repetitions and I think that is due to the If loop. I try to compare every item that is going to be appended to the list to those which have already stored but it fails ... so one of my wrong outputs are:

ENSG00000111111 ENSG00000111111 ENSG00000111111
ENSG00000111111 ENSG00000111111 ENSG00000111222
ENSG00000111111 ENSG00000111111 ENSG00000111111
ENSG00000111555 ENSG00000111111 ENSG00000111111
...

Thanks beforehand!, I hope to be clear with the explanation of my problem


Solution

  • I haven't tested it yet, but this should do it

    import csv
    import itertools
    import random
    
    with open('file.txt') as infile:
        data = list(csv.reader(infile, delimiter='\t'))
    
    answer = []
    combos = list(itertools.combinations_with_replacement([0,1,2,3], 3))
    for _ in range(n):
        selection = []
        while len(set(selection)) != 3:
            inds = random.choice(combos)
            rows = random.sample(data, 3)
            selection = [row[i] for i,row in zip(inds, rows)]
        answer.append(selection)