I have an ordered dataframe that has a categorical variable (specifically, bird families) in one of the columns. Now, because they are ordered in a specific manner, with one row for each species, some bird families have all species squished together in the dataframe while some others are interrupted by species of other families. What I am trying to do is subset each chunk (same family) one by one as they appear, in a forloop, so that I can do some further processing before moving on to the next chunk. I cannot use unique() as this negates the fact that some families are interrupted, putting them all together in the subset. Here's an example subset of the dataset:
structure(list(jetzspp = c("Acanthisitta_chloris", "Xenicus_gilviventris",
"Ampelioides_tschudii", "Pipreola_aureopectus", "Pipreola_chlorolepidota",
"Xenopipo_holochlora", "Xenopipo_uniformis", "Tityra_cayana",
"Tityra_inquisitor", "Tachuris_rubrigastra", "Conopias_parvus"
), iocorder = c("Passeriformes", "Passeriformes", "Passeriformes",
"Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes",
"Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes"
), bird_family = c("Acanthisittidae", "Acanthisittidae", "Cotingidae",
"Cotingidae", "Cotingidae", "Pipridae", "Pipridae", "Cotingidae",
"Cotingidae", "Tyrannidae", "Tyrannidae")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
Here I want to subset Acanthisittidae and then Cotingidae (only the three squished together) and Pipridae and then Cotingidae (the next two squished together) and so on.
Further, I want to isloate the names of the bird families that have an interrupted appearance in the dataset.
I have not found a function to track transitions yet. Making each interrupted family chunk name unique and then looping through them can be possible (?) but its not efficient for a database 3k+ long with many interruptions as I will have to change them back.
This may help with part of your question. The consecutive_id()
and mutate()
functions from dplyr
will give you a new column with each grouping in bird_family having a unique id. Then, you can use those values for your loop. If you provide more information about your final desired output, it may be possible to do this without using a loop. Either way, here's a solution to get you started.
Update based on comment from OP:
To return an alternating binary identifier, you can apply the modulo operator %%
to the values created previously by the consecutive_id()
function. I have created two columns for illustrative purposes. If you only require a single id column, declare "binary_id" as "id" and the values in id will be overwritten:
library(dplyr)
df <- structure(list(jetzspp = c("Acanthisitta_chloris", "Xenicus_gilviventris",
"Ampelioides_tschudii", "Pipreola_aureopectus", "Pipreola_chlorolepidota",
"Xenopipo_holochlora", "Xenopipo_uniformis", "Tityra_cayana",
"Tityra_inquisitor", "Tachuris_rubrigastra", "Conopias_parvus"
), iocorder = c("Passeriformes", "Passeriformes", "Passeriformes",
"Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes",
"Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes"
), bird_family = c("Acanthisittidae", "Acanthisittidae", "Cotingidae",
"Cotingidae", "Cotingidae", "Pipridae", "Pipridae", "Cotingidae",
"Cotingidae", "Tyrannidae", "Tyrannidae")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
# Add unique id and binary id to each 'run' of values
df <- df %>%
mutate(id = consecutive_id(bird_family),
binary_id = ifelse(id %% 2 == 0, 1, 0)) %>%
data.frame()
df
jetzspp iocorder bird_family id binary_id
1 Acanthisitta_chloris Passeriformes Acanthisittidae 1 0
2 Xenicus_gilviventris Passeriformes Acanthisittidae 1 0
3 Ampelioides_tschudii Passeriformes Cotingidae 2 1
4 Pipreola_aureopectus Passeriformes Cotingidae 2 1
5 Pipreola_chlorolepidota Passeriformes Cotingidae 2 1
6 Xenopipo_holochlora Passeriformes Pipridae 3 0
7 Xenopipo_uniformis Passeriformes Pipridae 3 0
8 Tityra_cayana Passeriformes Cotingidae 4 1
9 Tityra_inquisitor Passeriformes Cotingidae 4 1
10 Tachuris_rubrigastra Passeriformes Tyrannidae 5 0
11 Conopias_parvus Passeriformes Tyrannidae 5 0
for(i in unique(df$id)) {
print(subset(df, df$id == i))
}
jetzspp iocorder bird_family id binary_id
1 Acanthisitta_chloris Passeriformes Acanthisittidae 1 0
2 Xenicus_gilviventris Passeriformes Acanthisittidae 1 0
jetzspp iocorder bird_family id binary_id
3 Ampelioides_tschudii Passeriformes Cotingidae 2 1
4 Pipreola_aureopectus Passeriformes Cotingidae 2 1
5 Pipreola_chlorolepidota Passeriformes Cotingidae 2 1
jetzspp iocorder bird_family id binary_id
6 Xenopipo_holochlora Passeriformes Pipridae 3 0
7 Xenopipo_uniformis Passeriformes Pipridae 3 0
jetzspp iocorder bird_family id binary_id
8 Tityra_cayana Passeriformes Cotingidae 4 1
9 Tityra_inquisitor Passeriformes Cotingidae 4 1
jetzspp iocorder bird_family id binary_id
10 Tachuris_rubrigastra Passeriformes Tyrannidae 5 0
11 Conopias_parvus Passeriformes Tyrannidae 5 0