Search code examples
rdplyrdatatablecontingencystatistical-test

Create 2x2 Conditional Table from Counting Instances in Dataframe


I have a dataframe that is broken down by the gender of an author, their role in a project, and an identifier (PMID) (see below).

What I need is to create a 2x2 contingency table so I can calculate an odds ratio for the association between being a female first author and having a female senior author. To get that, I need to calculate the following:

  • A: Number of times first author is female AND senior author is female
  • B: Number of times first author is female AND senior author is male
  • C: Number of times first author is male AND senior author is male
  • D: Number of times first author is male AND senior author is female (and obviously drop cases where there is only a senior or only a first author per PMID)

I have grouped the table (see below) by PMID, so I really just need to figure out how to count each instance of the above. Having a hard time, would greatly appreciate any help!


# A tibble: 178,056 x 3
# Groups:   pmid [101,907]
    gender authorship pmid    
    <chr>  <chr>      <chr>   
  1 male   First      18958667
  2 male   Senior     18958667
  3 male   First      18958651
  4 male   First      18751818
  5 male   Senior     18751818
  6 male   First      18751811
  7 male   Senior     18751811
  8 female First      18751810
  9 female Senior     18751810
 10 male   First      18088800
 11 male   Senior     18088800
 12 male   First      17710072
 13 female First      17977216
 14 male   Senior     17762065
 15 male   First      17611457
 16 male   First      17611433
 17 male   First      17532688
 18 male   Senior     17532688
 19 female First      17405310
 20 male   Senior     17386862
 21 female First      17319096
 22 male   Senior     17319096
 23 female First      17300028
 24 male   First      17282480
 25 female First      17177771
 26 male   First      17124681
 27 female First      17093906
 28 female First      17042011
 29 male   Senior     17042011
 30 female First      17042010
 31 male   Senior     17042010
 32 female First      17042006
 33 male   Senior     17042006
 34 female First      17042003
 35 female First      17042002
 36 male   Senior     17042002
 37 male   First      17042001
 38 female First      17041999
 39 male   Senior     17041997
 40 female First      17041995
 41 female First      17041994
 42 female First      17041993
 43 female Senior     17041993
 44 female First      17041992
 45 female Senior     17041992
 46 female First      17041991
 47 male   First      17041990
 48 male   Senior     17041990
 49 male   First      17041989
 50 male   Senior     17041989


Solution

  • Tidy solution with golf-coding:

    library(tidyr)
    contig_table <- mydf %>% 
      spread(authorship, gender) %>% 
      #Only need drop_na() if data is incomplete
      #Given that you have a lot more rows I assume this will not be needed
      drop_na() %$% 
      table(First, Senior)