Resampling overlap between two lists

I have two .txt files. Both files are lists of strings with one string per row eg.

GRIM1
PHOXA2
SLITRK4

Both text files are ~ 20,000 rows long. I want to randomly sample 500 strings from file 1 and 700 strings from file 2. Then I want to count the number of strings that overlap both these subsets.
Then I want to repeat this process 100 times and calculate the min, max and mean number of strings that overlap these subsets from the 100 resamplings.

I was trying to adapt some code that used to worked for similar tasks but I get an error:

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

This code was:

listA <- read.csv(file="file1.txt", header=F)
listB <- read.csv(file="file2.txt", header=F)

listA <- as.character(listA) # to check that you really have a vector of gene names #maybe you have to do: listA <- as.character(listA)
listB <- as.character(listB) 

res <- rep(NA, 100) 
genesToDraw <- 500 # how many to select 
genesToDraw2 <- 700 # if you want to take different number from second list

for(i in 1:length(res)){

drawA <- sample(x=listA, size=genesToDraw, replace=FALSE)
drawB <- sample(x=listB, size=genesToDraw2, replace=FALSE) # or size=genesToDraw2

res[i] <- length(intersect(drawA, drawB))
}

hist(res, breaks=20)
table(res)
max(res)
sum(res > 5) # how often i

Thanks in advance for your help and please let me know if I should clarify.

In response to comments when I run dput(listA) and dput(listB) after the as.character part of the code I get a bunch of comma seperated numbers as output. Here is a subset:

1100, 4576, 7394, 1343, 4997, 13807, 1233, 9580, 15254, 10466, 3333, 622, 11177, 4067, 4800, 7592, 5363, 9646, 11213, 14314, 2475, 8389, \n12559, 12808, 5248, 10423, 7856, 12976, 9695, 1674, 2090, 9369, 12089, 13952, 1218, 7966, 6949, 4088, 623, 4768, 2002, 11776, 14710, 5502, 6212, 7300, 2123, 7194, 2128, 1683, 14987, 4491, 2672, 10275, 9424, 997, 15506, 14307, 2644, 11508, 9272, 5107, 10146, 11693, 1802, 652, 13073, 4268, 5435, 718, 4845

Best regards,

Rubal

Solution

As we discussed, first since you are expecting strings, set the stringsAsFactors flag to false in the read.csv calls so you don't mess with factors

listA <- read.csv(file="file1.txt", header=FALSE, stringsAsFactors=FALSE)
listB <- read.csv(file="file2.txt", header=FALSE, stringsAsFactors=FALSE)

Now you will have two dataframes, each with one column, of character objects. the sample function requires vectors, so we can convert our one column dataframes to vectors via

listA<-listA[,1]
listB<-listB[,1]

and that should get your code to run!