I have a set of say, 100, genomic features for which I've created a fasta file with a 500 bp window around each. I've searched these windows for a DNA sequence and found an average of 1.5 sequences per individual 500 bp window in the feature set. By chance, I expect the sequence to be present once every 1024 bp, or on average ~0.49 of my sequence per 500 bp window.
My question is how can I determine whether the 1.5 binding sites per individual feature I've uncovered is significant or not, and obtain a p-value?
And as a follow up, if I use the same set of 100 windows and search for a different sequence with the same probability (1/1024) and determine that there are now an average of 0.9 sequences per individual window, how can I determine whether this is significantly different than the 1.5 for the sequence for which I searched above?
As a second follow up, if I search for the same two sequences above (both found on average 1/1024 base pairs) in a different set of 500 bp windows for a different feature type (say, n=50), how can I determine if the results of this search are significantly different than the results above (particularly if the difference between sequence A and sequence B in feature set 1 and feature set 2 is significant)?
Thank you in advance.
I ended up using simulations to answer all of the above questions. Generate windows of desired size, 500 bp in this case, of random genomic sequence. Search for motifs in X windows (where X = number of individuals in a feature set) and compare with results obtained searching for motifs in the features of interest. To Repeat with sample size equal to that of the second feature set being analyzed. To compare features with one another, do the analogous simulation and compare results.