Search code examples
rmatrixdata-cleaningadjacency-matrix

Creating adjacency matrix with dirty dataset


This is the second time I am posting this question having deleted the first due to lack of reproducibility.

I referenced previously answered questions (Creating Adjacency Matrix and Social Network Graph, Creating adjacency matrix from raw data for centrality, Clean one column from long and big data set) but am struggling between the data cleaning and then creating the matrix.

Here is part of the df I am working from -

Species     Association              
1 RC          SKS/BW                   
2 BW          Sykes, rc                
3 SKS         Babo/bw                  
4 RC          baboon, mangabey         
5 Mang        red colobus, bw, sykes   
6 SKS         babo/red duiker

I am working on creating a simple social network matrix to answer "who is associating with whom at what frequency".

To clean the data I have selected the needed columns (Species and Association) and created a column to indicate the specific site where this data was collected

df.clean <-  mutate(df, Association=fct_collapse(Association, 
  BW=c("SKS/BW" ,"Babo/bw", "red colobus, bw, sykes"), 
  RC=c("Sykes, rc" ,"red colobus, bw, sykes"), 
  SKS=c("SKS/BW", "Sykes, rc", "red colobus, bw, sykes"), 
  Mang=c("baboon, mangabey"), 
  BABO=c("Babo/bw", "baboon, mangabey", "babo/red duiker"), 
  RD=c("babo/red duiker")) %>% 
select(Species, Association) %>% 
add_column(Site = "Protected") %>% 
filter(Species!= "RD", Association!= "RD") %>% 
mutate(Species = factor(as.character(Species)))

However, when I look at the column "Association" after this step, I am only seeing one species value (i.e. bw instead of bw,rc) in the entire column.

I assume I have buggered up my dataset by using the 'fct_collapse()' function while cleaning? I am looking for an output dataframe like this -

Species     Association              Site
1 RC          SKS, BW                  Protected
2 BW          SKS, RC                  Protected
3 SKS         BABO, BW                 Protected
4 RC          BABO, Mang               Protected
5 Mang        RC, BW, SKS              Protected
6 SKS         BABO                     Protected

This brings me to my first question - what's the best way to clean dirty data like this while retaining the multiple values of information in the column? I am trying to create a dataframe like the above example assuming I will then need to code both the species and association column to numeric values to create my matrix. Will this work as written, or do I need to extract data from the column and create new columns?

I'm relatively new to r, so please let me know if I'm not making any sense. Any advice is greatly appreciated, and I apologize if there's any confusion.

Upon running code from the kind commenter, I am running into coding issues with the association column. Everything runs great, except, there are "NA"s for association anytime there was a "R/c monkeys" or "B/W colobus", essentially anytime there was a '/' in the naming of the association. Here

troubleshooting dput sample dataframe

structure(list(Species = structure(c(2L, 4L, 5L, 2L, 4L, 1L, 
5L, 4L, 5L, 4L), .Label = c("BABO", "BW", "Mang", "RC", "SKS"
), class = "factor"), Association = c("r/c monkeys", "b/w colobus", 
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(12, 
12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Dput output looks like -

Species    Association                  Year
<fctr>     <chr>                        <dbl>
BW         r/c monkeys                  12
RC         b/w colobus                  12
SKS        b/w colobus/R/c monkeys      12
BW         sykes/R/c monkeys            12
RC         sykes/b/w colobus            12
BABO       .                            12
SKS        .                            12
RC         .                            12
SKS        r/c monkeys                  12
RC         sykes monkeys                12

Ideal dput output -

Species    Association                  Year
<fctr>     <chr>                        <dbl>
BW         RC                           12
RC         BW                           12
SKS        BW, RC                       12
BW         SKS, RC                      12
RC         SKS, BW                      12
BABO       NA                           12
SKS        NA                           12
RC         NA                           12
SKS        RC                           12
RC         SKS                          12

Solution

  • Using strsplit() and toString(). Just use a regular expression like '(?<=\\w{2})\\/|,\\s' that covers all cases. Enjoy the playground*.

    *Note that only single escapes \ are required there, while double escapes \\ are required in R.

    regex <- '(?<=\\w{2})\\/|,\\s'
    

    To fix the different name versions, we first use toupper() which already takes away some of the pain. Then use a dictionary that you neatly can paste together in your script using read.table(),

    dict <- read.table(header=TRUE, text='
                from  to
                 "."  .
       "B/W COLOBUS"  BW
              "BABO"  BABO
            "BABOON"  BABO
                "BW"  BW
          "MANGABEY"  MANG
       "R/C MONKEYS"  RC
                "RC"  RC
       "RED COLOBUS"  COLO
        "RED DUIKER"  DUIK
               "SKS"  SKS
             "SYKES"  SKS
     "SYKES MONKEYS"  SKS
    ') 
    

    where you might find that one helpful:

    spf <- '"%s"'
    # spf <- '%s'  ## for the playground (see above)
    
    data.frame(from=
                 sprintf(spf, 
                         sort(unique(unlist(
                           strsplit(toupper(dat$Association), regex, perl=TRUE)
                         )))
                 )
    ) |> print(row.names=FALSE)
    

    Then strsplit, use the dictionary to replace names, and clean it a little:

    res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
      lapply(\(x) dict[match(x, dict$from), ]$to) |>
      sapply(toString) |>
      {\(.) replace(., . == ".", NA)}() |>
      data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
      setNames(c('association', 'site', 'species', 'year')) |>
      subset(select=c(3, 1, 2, 4))
    
    res
    #    species   association      site year
    # 1       RC       SKS, BW Protected   NA
    # 2       BW       SKS, RC Protected   NA
    # 3      SKS      BABO, BW Protected   NA
    # 4       RC    BABO, MANG Protected   NA
    # 5     MANG COLO, BW, SKS Protected   NA
    # 6      SKS    BABO, DUIK Protected   NA
    # 7       BW            RC Protected   12
    # 8       RC            BW Protected   12
    # 9      SKS        BW, RC Protected   12
    # 10      BW       SKS, RC Protected   12
    # 11      RC       SKS, BW Protected   12
    # 12    BABO          <NA> Protected   12
    # 13     SKS          <NA> Protected   12
    # 14      RC          <NA> Protected   12
    # 15     SKS            RC Protected   12
    # 16      RC           SKS Protected   12
    

    Note: R >= 4.1 used.


    Data:

    dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS", 
    "BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC"
    ), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey", 
    "red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus", 
    "b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
    ".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(NA, 
    NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", 
    "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", 
    "7", "8", "9", "10"), class = "data.frame")