Search code examples
rdataframedplyrtidyversedata-wrangling

Expanding dataframe to include non existing values


I have a dataframe that looks like this:

Family Order Class Presence Year Site Location Lat Long
Aeshnidae Odonata Insecta 0 2021 KAV01 NASS -17.4 18.5
Aeshnidae Odonata Insecta 0 2023 KAV01 NASS -17.4 18.5
Aeshnidae Odonata Insecta 1 2021 KAV02 NASS -17.7 18.7
Aeshnidae Odonata Insecta 0 2023 KAV02 NASS -17.7 18.7
Aeshnidae Odonata Insecta 0 2021 KAV03 NASS -17.8 19.1
Aeshnidae Odonata Insecta 0 2023 KAV03 NASS -17.8 19.1

There are a number of unique families, and I have presence values of 0 and 1, but the family value is not fully covered. This is, not all family values have assigned a value of 0 or 1 of presence. Let me explain more, in all the dataset there is a unique number of families = 100, yet the site KAV03 has values of presence/absence for 90. What I want is to make sure that the 10 remaining families are also present, with, of course, a presence value of 0. With this expansion of the data I also want to keep the remaining variables as they are.

I hope I explained well enough, let me know if you need more info. Here is a code I have tried but failed:

MorphoData <- expand.grid(
  Site = unique(MorphoData$Site), 
  Family = unique(MorphoData$Family), 
  Year = unique(MorphoData$Year)
) %>%
  left_join(MorphoData, by = c("Site", "Family", "Year")) %>%
  group_by(Site, Family, Year) %>%
  mutate(
    Presence = replace_na(Presence, 0)  # Ensure missing Presence values are 0
  ) %>%
  group_by(Site, Year) %>%
  fill(everything(.), .direction = "downup") %>%  # Fill missing taxonomy/spatial data
  ungroup()

Here is an hypothetical example of what I have:

data <- data_frame(Site = c("KAV01", "KAV01", "KAV01", "KAV01", "KAV01", "KAV01", "KAV01",
                            "KAV02", "KAV02", "KAV02", "KAV02", "KAV02", "KAV02", "KAV02",
                            "KAV03", "KAV03", "KAV03", "KAV03", "KAV03", "KAV03", "KAV03"),
                   Family = sample(c("Fam1", "Fam2", "Fam3", "Fam4", "Fam5", "Fam6"), 21, replace = TRUE),
                   Year = sample(c(2021, 2022, 2023), 21, replace = TRUE),
                   Presence = sample(c(0, 1), 21, , replace = TRUE),
                   Lon = rnorm(n = 21, mean = 5, sd = 1),
                   Lat = rnorm(n = 21, mean = 2, sd = 0.3))

You will see that the sites will not have values for all families. I want the families that are not included for a given site to be so with a presence value of 0.

Thank you,


Solution

  • Try complete

    library(tidyr)
    
    complete(data, Site, Family, Year, fill = list(Presence = 0))