I have two .txt files. Both files are lists of strings with one string per row eg.
GRIM1
PHOXA2
SLITRK4
Both text files are ~ 20,000 rows long. I want to randomly sample 500 strings from file 1 and 700 strings from file 2.
Then I want to count the number of strings that overlap both these subsets.
Then I want to repeat this process 100 times and calculate the min, max and mean number of strings that overlap these subsets from the 100 resamplings.
I was trying to adapt some code that used to worked for similar tasks but I get an error:
Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'
This code was:
listA <- read.csv(file="file1.txt", header=F)
listB <- read.csv(file="file2.txt", header=F)
listA <- as.character(listA) # to check that you really have a vector of gene names #maybe you have to do: listA <- as.character(listA)
listB <- as.character(listB)
res <- rep(NA, 100)
genesToDraw <- 500 # how many to select
genesToDraw2 <- 700 # if you want to take different number from second list
for(i in 1:length(res)){
drawA <- sample(x=listA, size=genesToDraw, replace=FALSE)
drawB <- sample(x=listB, size=genesToDraw2, replace=FALSE) # or size=genesToDraw2
res[i] <- length(intersect(drawA, drawB))
}
hist(res, breaks=20)
table(res)
max(res)
sum(res > 5) # how often i
Thanks in advance for your help and please let me know if I should clarify.
In response to comments when I run dput(listA) and dput(listB) after the as.character part of the code I get a bunch of comma seperated numbers as output. Here is a subset:
1100, 4576, 7394, 1343, 4997, 13807, 1233, 9580, 15254, 10466, 3333, 622, 11177, 4067, 4800, 7592, 5363, 9646, 11213, 14314, 2475, 8389, \n12559, 12808, 5248, 10423, 7856, 12976, 9695, 1674, 2090, 9369, 12089, 13952, 1218, 7966, 6949, 4088, 623, 4768, 2002, 11776, 14710, 5502, 6212, 7300, 2123, 7194, 2128, 1683, 14987, 4491, 2672, 10275, 9424, 997, 15506, 14307, 2644, 11508, 9272, 5107, 10146, 11693, 1802, 652, 13073, 4268, 5435, 718, 4845
Best regards,
Rubal
As we discussed, first since you are expecting strings, set the stringsAsFactors flag to false in the read.csv calls so you don't mess with factors
listA <- read.csv(file="file1.txt", header=FALSE, stringsAsFactors=FALSE)
listB <- read.csv(file="file2.txt", header=FALSE, stringsAsFactors=FALSE)
Now you will have two dataframes, each with one column, of character objects. the sample function requires vectors, so we can convert our one column dataframes to vectors via
listA<-listA[,1]
listB<-listB[,1]
and that should get your code to run!