I have a large dataset (df), which looks something like this (simplified to not include all bioclimatic variables, but just imagine six columns rather than two beginning with "bio"):
name longitude latitude time_bp bio01 bio12
1 Species A 24.633301 -27.61670 0 18.5594425 467.2066
2 Species A 25.549999 -28.66670 0 17.1183109 487.5667
3 Species A 29.083300 25.80000 0 21.7980595 28.1200
4 Species B 23.033300 -28.20560 0 17.4872379 398.5200
5 Species B 25.500000 23.00000 0 21.3069401 72.9600
6 Species C 24.633301 -27.61670 0 18.5594425 467.2066
7 Species B 29.600000 -0.13500 0 23.3799686 973.3167
8 Species C 33.750000 16.25000 0 28.5797062 137.5567
9 Species A 33.750000 16.25000 0 28.5797062 137.5567
10 Species C 33.750000 16.25000 0 28.5797062 137.5567
11 Species D 33.750000 16.25000 0 28.5797062 137.5567
12 Species D 33.750000 16.25000 0 28.5797062 137.5567
13 Species A 33.750000 16.25000 0 28.5797062 137.5567
14 Species B 33.750000 16.25000 0 28.5797062 137.5567
15 Species E 33.750000 16.25000 0 28.5797062 137.5567
With multiple time stamps, species, and locations. I am interested in measuring climatic niche breadth. Because the temporal resolution of my data is coarser than my climatic data, I want to average the climatic data, I am averaging the climatic data for multiple time stamps. Most of the time, if there is data for one time stamp, there is data for all of them, but that is not always the case. There are also several cases where there is no data for a species at all or most time stamps, or where the multiple points fall within the same grid, and thus have the same bioclimatic information. I am excluding species which do not have at least five points with unique bioclimatic data. Thus far I have been going through each species manually to see whether there are five unique points, and then doing the following:
Splitdf <- split(df, df$name)
SpeciesADat <- Splitdf $`Species A`
SpeciesADat <- split(SpeciesADat, SpeciesADat $latitude)
SpeciesADatBio01List <- c(mean(SpeciesADat$`42.611382`$bio01, na.rm = TRUE),
mean(SpeciesADat$`-44.457764`$bio01, na.rm = TRUE),
mean(SpeciesADat$`-44.450432`$bio01, na.rm = TRUE),
mean(SpeciesADat$`-44.223461`$bio01, na.rm = TRUE),
mean(SpeciesADat$`-44.185169`$bio01, na.rm = TRUE))
SpeciesADatBio01List <- na.omit(SpeciesADatBio01List)
SpeciesADatB01Breadth <- max(SpeciesADatBio01List) - min(SpeciesADatBio01List)
SpeciesADatBi012List <- c(mean(SpeciesADat$`42.611382`$bio12, na.arm = TRUE),
mean(SpeciesADat$`-44.457764`$bio12, na.rm = TRUE),
mean(SpeciesADat$`-44.450432`$bio12, na.rm = TRUE),
mean(SpeciesADat$`-44.223461`$bio12, na.rm = TRUE),
mean(SpeciesADat$`-44.185169`$bio12, na.rm = TRUE))
SpeciesADatBi012List <- na.omit(SpeciesADatBi012List)
SpeciesADatB012Breadth <- max(SpeciesADatBi012List)-min(SpeciesADatBi012List)
SpeciesAData <- data.frame(Bio01Breadth=c(SpeciesADatB01Breadth),
Bio12Breadth=c(SpeciesADatB012Breadth),
Species=c("Species A"))
With the plan of merging the various SpeciesXData data frames which I create for each species to have a final product that looks like this:
Bio01Breadth Bio12Breadth Species
1 32.9588 1912.312 Species A
2 3.878775 248.6758 Species B
3 29.51849 840.4629 Species C
Is there a way to automate this process, so I can go through all species with one loop while also cleaning the data in that loop?
You can use the tidyverse
to get the desired output. The main steps are: 1) transforming your data into a long format, 2) identify the species that have at least x different latitudes, and 3) average values from the same latitudes and then calculate the max - min.
library(dplyr)
library(tidyr)
df <- df |>
pivot_longer(cols = starts_with("bio"),
names_prefix = "bio",
names_to = "biovar",
values_to = "val")
check5 <- df |>
# Group by species latitude and bio variable
group_by(name, latitude, biovar) |>
# Get first value just to count number of different latitudes
slice_head(n = 1) |>
ungroup() |>
group_by(name, biovar) |>
# count
count() |>
# Filter to get the species with more than x different values
filter(n >= 2) |>
# Get species' names
pull(name) |>
unique()
df |>
# Remove species that do not meet the criteria of at least 5 different latitudes
filter(name %in% check5) |>
# Group by name, latitude and biovar
group_by(name, latitude, biovar) |>
# Average values by latitude
summarise(mean = mean(val),
.groups = "drop") |>
# Group by name and biovar
group_by(name, biovar) |>
# Calculate breadth
summarise(breadth = max(mean)-min(mean))
# A tibble: 6 × 3
# Groups: name [3]
# name biovar breadth
# <chr> <chr> <dbl>
#1 Species A 01 11.5
#2 Species A 12 459.
#3 Species B 01 11.1
#4 Species B 12 900.
#5 Species C 01 10.0
#6 Species C 12 330.