Search code examples
rcorrelationdna-sequence

Correlation between groups and ranks over different samples with R


I'm studying some scores on DNA for which each position has a score. I would like to find a method to know whether some samples are more often likely to have a high score, not in general, but position per position. Some positions are not defined on all samples, and some samples don't have score for a given position.

data.frame('pos'=c(1,2,3,1,2,3,1,2,5), 'sample'=c('A','A','A','B','B','B','C','C','C'), 'score'=c(1,10,5,20,40,10,0.1,5,4))

I'd like to know using a spearman correlation (I'm looking for rankings as there is no real biological reasons to compare position 1 and 2 for instance) whether some samples are more likely to have the "top" scoring values. My difficulty is that I have actually two qualitative values : the sample ID and the position and only one quantitative. I don't manage to indicate to R that I want somehow to group the data by position and then have a ranking on each position to study the correlation of rankings.

Finally I'd like to have a spearman correlation score assessing in that dataset that sample B is the top-scorer on most of the positions.

Any idea on how to achieve that?

Thanks a lot !


Solution

  • Maybe this points in the a helpful direction.

    library(tidyverse)
    df = data.frame('pos'=c(1,2,3,1,2,3,1,2,3), # Using 3 as the last position
    'sample'=c('A','A','A','B','B','B','C','C','C'), 
    'score'=c(1,10,5,20,40,10,0.1,5,4))
    
    # Compute rank of each sample within each position
    ranked = df %>% group_by(pos) %>% 
      mutate(rank=rank(score, ties.method='min')) %>% 
      ungroup()
    
    # B seems to consistently score higher
    ggplot(ranked, aes(pos, rank, color=sample)) +
      geom_point(size=5)
    

    
    # Kruskal-Wallis rank sum test of the null hypothesis that the rankings
    # are from the same distribution for all samples.
    kruskal.test(ranked$rank, ranked$sample)
    #> 
    #>  Kruskal-Wallis rank sum test
    #> 
    #> data:  ranked$rank and ranked$sample
    #> Kruskal-Wallis chi-squared = 8, df = 2, p-value = 0.01832
    
    # Pairwise Wilcoxon test for B vs C
    df %>% filter(sample!='A') %>% 
      group_by(pos) %>% 
      mutate(rank=rank(score, ties.method='min')) %>% 
      ungroup() %>% 
      pivot_wider(id_cols='pos', names_from='sample', values_from='rank') %>% 
      {wilcox.test(.$B, .$C, paired=TRUE)}
    #> Warning in wilcox.test.default(.$B, .$C, paired = TRUE): cannot compute exact p-
    #> value with ties
    #> 
    #>  Wilcoxon signed rank test with continuity correction
    #> 
    #> data:  .$B and .$C
    #> V = 6, p-value = 0.1489
    #> alternative hypothesis: true location shift is not equal to 0
    

    If the scores are all drawn from the same distribution, I think you could do these same tests on the scores directly, without ranking.

    Created on 2020-01-10 by the reprex package (v0.3.0)