Search code examples
rloops

How to create a loop with multiple condition statements and groupings in R?


I have a large dataset (df), which looks something like this (simplified to not include all bioclimatic variables, but just imagine six columns rather than two beginning with "bio"):

                        name   longitude  latitude time_bp       bio01     bio12
1           Species A          24.633301 -27.61670       0  18.5594425  467.2066
2           Species A          25.549999 -28.66670       0  17.1183109  487.5667
3           Species A          29.083300  25.80000       0  21.7980595   28.1200
4           Species B          23.033300 -28.20560       0  17.4872379  398.5200
5           Species B          25.500000  23.00000       0  21.3069401   72.9600
6           Species C          24.633301 -27.61670       0  18.5594425  467.2066
7           Species B          29.600000  -0.13500       0  23.3799686  973.3167
8           Species C          33.750000  16.25000       0  28.5797062  137.5567
9           Species A          33.750000  16.25000       0  28.5797062  137.5567
10          Species C          33.750000  16.25000       0  28.5797062  137.5567
11          Species D          33.750000  16.25000       0  28.5797062  137.5567
12          Species D          33.750000  16.25000       0  28.5797062  137.5567
13          Species A          33.750000  16.25000       0  28.5797062  137.5567
14          Species B          33.750000  16.25000       0  28.5797062  137.5567
15          Species E          33.750000  16.25000       0  28.5797062  137.5567

With multiple time stamps, species, and locations. I am interested in measuring climatic niche breadth. Because the temporal resolution of my data is coarser than my climatic data, I want to average the climatic data, I am averaging the climatic data for multiple time stamps. Most of the time, if there is data for one time stamp, there is data for all of them, but that is not always the case. There are also several cases where there is no data for a species at all or most time stamps, or where the multiple points fall within the same grid, and thus have the same bioclimatic information. I am excluding species which do not have at least five points with unique bioclimatic data. Thus far I have been going through each species manually to see whether there are five unique points, and then doing the following:

Splitdf <- split(df, df$name)

SpeciesADat <- Splitdf $`Species A`
SpeciesADat <- split(SpeciesADat, SpeciesADat $latitude)
SpeciesADatBio01List <- c(mean(SpeciesADat$`42.611382`$bio01, na.rm = TRUE),
                          mean(SpeciesADat$`-44.457764`$bio01, na.rm = TRUE),
                          mean(SpeciesADat$`-44.450432`$bio01, na.rm = TRUE),
                          mean(SpeciesADat$`-44.223461`$bio01, na.rm = TRUE),
                          mean(SpeciesADat$`-44.185169`$bio01, na.rm = TRUE))
SpeciesADatBio01List <- na.omit(SpeciesADatBio01List)
SpeciesADatB01Breadth <- max(SpeciesADatBio01List) - min(SpeciesADatBio01List)
SpeciesADatBi012List <- c(mean(SpeciesADat$`42.611382`$bio12, na.arm = TRUE),
                                mean(SpeciesADat$`-44.457764`$bio12, na.rm = TRUE),
                                mean(SpeciesADat$`-44.450432`$bio12, na.rm = TRUE),
                                mean(SpeciesADat$`-44.223461`$bio12, na.rm = TRUE),
                                mean(SpeciesADat$`-44.185169`$bio12, na.rm = TRUE))
SpeciesADatBi012List <- na.omit(SpeciesADatBi012List)
SpeciesADatB012Breadth <- max(SpeciesADatBi012List)-min(SpeciesADatBi012List)
SpeciesAData <- data.frame(Bio01Breadth=c(SpeciesADatB01Breadth),
                           Bio12Breadth=c(SpeciesADatB012Breadth),
                           Species=c("Species A"))

With the plan of merging the various SpeciesXData data frames which I create for each species to have a final product that looks like this:

  Bio01Breadth Bio12Breadth            Species
1      32.9588     1912.312          Species A
2      3.878775    248.6758          Species B
3      29.51849    840.4629          Species C                                              

Is there a way to automate this process, so I can go through all species with one loop while also cleaning the data in that loop?


Solution

  • You can use the tidyverse to get the desired output. The main steps are: 1) transforming your data into a long format, 2) identify the species that have at least x different latitudes, and 3) average values from the same latitudes and then calculate the max - min.

    library(dplyr)
    library(tidyr)
    
    df <- df |>
      pivot_longer(cols = starts_with("bio"),
                   names_prefix = "bio",
                   names_to = "biovar",
                   values_to = "val")
    
    check5 <- df |>
      # Group by species latitude and bio variable
      group_by(name, latitude, biovar) |>
      # Get first value just to count number of different latitudes
      slice_head(n = 1) |>
      ungroup() |>
      group_by(name, biovar) |>
      # count
      count() |>
      # Filter to get the species with more than x different values
      filter(n >= 2) |>
      # Get species' names
      pull(name) |>
      unique()
    
    df |>
      # Remove species that do not meet the criteria of at least 5 different latitudes
      filter(name %in% check5) |>
      # Group by name, latitude and biovar
      group_by(name, latitude, biovar) |>
      # Average values by latitude
      summarise(mean = mean(val),
                .groups = "drop") |>
      # Group by name and biovar
      group_by(name, biovar) |>
      # Calculate breadth
      summarise(breadth = max(mean)-min(mean))
    
    # A tibble: 6 × 3
    # Groups:   name [3]
    #  name      biovar breadth
    #  <chr>     <chr>    <dbl>
    #1 Species A 01        11.5
    #2 Species A 12       459. 
    #3 Species B 01        11.1
    #4 Species B 12       900. 
    #5 Species C 01        10.0
    #6 Species C 12       330.