Search code examples
rmapply

Expand.grid p-value matrix fill equal variables with NA


I had to run a large amount of Chi-Square fisher tests on categorical data within a dataset. Because of the number of categorical variables I knew this would take a huge amount of time to do so, I found a function on here and modified it for the purpose I need.

>HRchi
    # A tibble: 6 x 13
  Position   State Sex   MaritalDesc CitizenDesc HispanicLatino RaceDesc TermReason  EmploymentStatus  Department ManagerName RecruitmentSour~
  <chr>      <chr> <chr> <chr>       <chr>       <chr>          <chr>    <chr>       <chr>             <chr>      <chr>       <chr>           
1 Productio~ MA    "M "  Single      US Citizen  No             White    N/A-StillE~ Active            "Producti~ Michael Al~ LinkedIn        
2 Sr. DBA    MA    "M "  Married     US Citizen  No             White    career cha~ Voluntarily Term~ "IT/IS"    Simon Roup  Indeed          
3 Productio~ MA    "F"   Married     US Citizen  No             White    hours       Voluntarily Term~ "Producti~ Kissy Sull~ LinkedIn        
4 Productio~ MA    "F"   Married     US Citizen  No             White    N/A-StillE~ Active            "Producti~ Elijiah Gr~ Indeed          
5 Productio~ MA    "F"   Divorced    US Citizen  No             White    return to ~ Voluntarily Term~ "Producti~ Webster Bu~ Google Search   
6 Productio~ MA    "F"   Single      US Citizen  No             White    N/A-StillE~ Active            "Producti~ Amy Dunn    LinkedIn        
# ... with 1 more variable: PerformanceScore <chr>
> 

The function I used to run the tests is as follows

col_combinations <-  expand.grid(names(HRchi), names(HRchi))
cor_test_wrapper <-  function(col_name1, col_name2, data_frame) {
  format(fisher.test(data_frame[[col_name1]], data_frame[[col_name2]],  
                     simulate.p.value = TRUE, B = 1e6)$p.value, scientific = F)
}

p_vals <- mapply(cor_test_wrapper, 
                col_name1 = col_combinations[[1]], 
                col_name2 = col_combinations[[2]], 
                MoreArgs = list(data_frame = HRchi))

Ficher.pvalue.matrix <- matrix(p_vals, 13, 13, dimnames = list(names(HRchi), names(HRchi)))
Ficher.pvalue.matrix

This returns a matrix of the p-values:

   rowname Position State Sex   MaritalDesc CitizenDesc HispanicLatino RaceDesc TermReason EmploymentStatus Department ManagerName RecruitmentSour~
   <chr>   <chr>    <chr> <chr> <chr>       <chr>       <chr>          <chr>    <chr>      <chr>            <chr>      <chr>       <chr>           
 1 Positi~ 0.00000~ 0.00~ 0.31~ 0.8194522   0.6830553   0.03777396     0.16237~ 0.9216931  0.01563398       0.0000009~ 0.00000099~ 0.000002999997  
 2 State   0.00000~ 0.00~ 0.14~ 0.5327625   0.4954165   0.4240866      0.00748~ 0.980687   0.8377042        0.0000009~ 0.00000099~ 0.02947497      
 3 Sex     0.31226~ 0.14~ 0.00~ 0.6979593   0.6987973   0.8145132      0.94932~ 0.6053784  0.959038         0.2443258  0.06263294  0.1271179       
 4 Marita~ 0.81893~ 0.53~ 0.69~ 0.00000099~ 0.9265121   0.5331945      0.48005~ 0.0059059~ 0.008646991      0.7705712  0.8863871   0.2533087       
 5 Citize~ 0.68347~ 0.49~ 0.70~ 0.9270521   0.00000099~ 1              0.05806~ 0.1407349  0.2222708        0.4063666  0.8475872   0.1891118       
 6 Hispan~ 0.03778~ 0.42~ 0.81~ 0.5330425   1           0.000000999999 0.04130~ 0.8368642  1                0.05423295 0.1162419   0.06414394      
 7 RaceDe~ 0.16164~ 0.00~ 0.94~ 0.4804555   0.05764794  0.04088996     0.00000~ 0.972402   0.8328322        0.08990291 0.01743098  0.000000999999  
 8 TermRe~ 0.92143~ 0.98~ 0.60~ 0.005702994 0.1414139   0.8366842      0.97238  0.0000009~ 0.000000999999   0.2481378  0.7842482   0.0002929997    
 9 Employ~ 0.01571~ 0.83~ 0.95~ 0.008722991 0.2230458   1              0.83268~ 0.0000009~ 0.000000999999   0.0025569~ 0.001606998 0.000000999999  
10 Depart~ 0.00000~ 0.00~ 0.24~ 0.7694292   0.4063906   0.05454395     0.09036~ 0.2486848  0.002619997      0.0000009~ 0.00000099~ 0.000000999999  
11 Manage~ 0.00000~ 0.00~ 0.06~ 0.8851031   0.8472942   0.1168469      0.01726~ 0.7852542  0.001648998      0.0000009~ 0.00000099~ 0.000001999998  
12 Recrui~ 0.00000~ 0.02~ 0.12~ 0.2529637   0.1878758   0.06357094     0.00000~ 0.0003429~ 0.000002999997   0.0000009~ 0.00000099~ 0.000000999999  
13 Perfor~ 0.76044~ 0.56~ 0.47~ 0.9184571   0.7584852   1              0.15887~ 0.06789893 0.003164997      0.6032454  0.2900097   0.3136187       
# ... with 1 more variable: PerformanceScore <chr>

What I want to know is if it is possible to have everything above the diagonal line (Position = Position, State = State, etc, etc) be equal to NA so the data frame is less confusing.


Solution

  • You could use upper.tri:

    Ficher.pvalue.matrix[upper.tri(Ficher.pvalue.matrix)]<-NA