This is the second time I am posting this question having deleted the first due to lack of reproducibility.
I referenced previously answered questions (Creating Adjacency Matrix and Social Network Graph, Creating adjacency matrix from raw data for centrality, Clean one column from long and big data set) but am struggling between the data cleaning and then creating the matrix.
Here is part of the df I am working from -
Species Association
1 RC SKS/BW
2 BW Sykes, rc
3 SKS Babo/bw
4 RC baboon, mangabey
5 Mang red colobus, bw, sykes
6 SKS babo/red duiker
I am working on creating a simple social network matrix to answer "who is associating with whom at what frequency".
To clean the data I have selected the needed columns (Species and Association) and created a column to indicate the specific site where this data was collected
df.clean <- mutate(df, Association=fct_collapse(Association,
BW=c("SKS/BW" ,"Babo/bw", "red colobus, bw, sykes"),
RC=c("Sykes, rc" ,"red colobus, bw, sykes"),
SKS=c("SKS/BW", "Sykes, rc", "red colobus, bw, sykes"),
Mang=c("baboon, mangabey"),
BABO=c("Babo/bw", "baboon, mangabey", "babo/red duiker"),
RD=c("babo/red duiker")) %>%
select(Species, Association) %>%
add_column(Site = "Protected") %>%
filter(Species!= "RD", Association!= "RD") %>%
mutate(Species = factor(as.character(Species)))
However, when I look at the column "Association" after this step, I am only seeing one species value (i.e. bw instead of bw,rc) in the entire column.
I assume I have buggered up my dataset by using the 'fct_collapse()' function while cleaning? I am looking for an output dataframe like this -
Species Association Site
1 RC SKS, BW Protected
2 BW SKS, RC Protected
3 SKS BABO, BW Protected
4 RC BABO, Mang Protected
5 Mang RC, BW, SKS Protected
6 SKS BABO Protected
This brings me to my first question - what's the best way to clean dirty data like this while retaining the multiple values of information in the column? I am trying to create a dataframe like the above example assuming I will then need to code both the species and association column to numeric values to create my matrix. Will this work as written, or do I need to extract data from the column and create new columns?
I'm relatively new to r, so please let me know if I'm not making any sense. Any advice is greatly appreciated, and I apologize if there's any confusion.
Upon running code from the kind commenter, I am running into coding issues with the association column. Everything runs great, except, there are "NA"s for association anytime there was a "R/c monkeys" or "B/W colobus", essentially anytime there was a '/' in the naming of the association. Here
troubleshooting dput sample dataframe
structure(list(Species = structure(c(2L, 4L, 5L, 2L, 4L, 1L,
5L, 4L, 5L, 4L), .Label = c("BABO", "BW", "Mang", "RC", "SKS"
), class = "factor"), Association = c("r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(12,
12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Dput output looks like -
Species Association Year
<fctr> <chr> <dbl>
BW r/c monkeys 12
RC b/w colobus 12
SKS b/w colobus/R/c monkeys 12
BW sykes/R/c monkeys 12
RC sykes/b/w colobus 12
BABO . 12
SKS . 12
RC . 12
SKS r/c monkeys 12
RC sykes monkeys 12
Ideal dput output -
Species Association Year
<fctr> <chr> <dbl>
BW RC 12
RC BW 12
SKS BW, RC 12
BW SKS, RC 12
RC SKS, BW 12
BABO NA 12
SKS NA 12
RC NA 12
SKS RC 12
RC SKS 12
Using strsplit()
and toString()
. Just use a regular expression like '(?<=\\w{2})\\/|,\\s'
that covers all cases. Enjoy the playground*.
*Note that only single escapes \
are required there, while double escapes \\
are required in R.
regex <- '(?<=\\w{2})\\/|,\\s'
To fix the different name versions, we first use toupper()
which already takes away some of the pain. Then use a dict
ionary that you neatly can paste together in your script using read.table()
,
dict <- read.table(header=TRUE, text='
from to
"." .
"B/W COLOBUS" BW
"BABO" BABO
"BABOON" BABO
"BW" BW
"MANGABEY" MANG
"R/C MONKEYS" RC
"RC" RC
"RED COLOBUS" COLO
"RED DUIKER" DUIK
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
')
where you might find that one helpful:
spf <- '"%s"'
# spf <- '%s' ## for the playground (see above)
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)
)))
)
) |> print(row.names=FALSE)
Then strsplit
, use the dictionary to replace names, and clean it a little:
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply(\(x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{\(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
res
# species association site year
# 1 RC SKS, BW Protected NA
# 2 BW SKS, RC Protected NA
# 3 SKS BABO, BW Protected NA
# 4 RC BABO, MANG Protected NA
# 5 MANG COLO, BW, SKS Protected NA
# 6 SKS BABO, DUIK Protected NA
# 7 BW RC Protected 12
# 8 RC BW Protected 12
# 9 SKS BW, RC Protected 12
# 10 BW SKS, RC Protected 12
# 11 RC SKS, BW Protected 12
# 12 BABO <NA> Protected 12
# 13 SKS <NA> Protected 12
# 14 RC <NA> Protected 12
# 15 SKS RC Protected 12
# 16 RC SKS Protected 12
Note: R >= 4.1 used.
Data:
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(NA,
NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1",
"2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61",
"7", "8", "9", "10"), class = "data.frame")