Search code examples
rdataframedplyrsubset

How do I subset values squished together (by rows) in a for loop and isolate values not squished together


I have an ordered dataframe that has a categorical variable (specifically, bird families) in one of the columns. Now, because they are ordered in a specific manner, with one row for each species, some bird families have all species squished together in the dataframe while some others are interrupted by species of other families. What I am trying to do is subset each chunk (same family) one by one as they appear, in a forloop, so that I can do some further processing before moving on to the next chunk. I cannot use unique() as this negates the fact that some families are interrupted, putting them all together in the subset. Here's an example subset of the dataset:

structure(list(jetzspp = c("Acanthisitta_chloris", "Xenicus_gilviventris", 
"Ampelioides_tschudii", "Pipreola_aureopectus", "Pipreola_chlorolepidota", 
"Xenopipo_holochlora", "Xenopipo_uniformis", "Tityra_cayana", 
"Tityra_inquisitor", "Tachuris_rubrigastra", "Conopias_parvus"
), iocorder = c("Passeriformes", "Passeriformes", "Passeriformes", 
"Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes", 
"Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes"
), bird_family = c("Acanthisittidae", "Acanthisittidae", "Cotingidae", 
"Cotingidae", "Cotingidae", "Pipridae", "Pipridae", "Cotingidae", 
"Cotingidae", "Tyrannidae", "Tyrannidae")), row.names = c(NA, 
-11L), class = c("tbl_df", "tbl", "data.frame"))

Here I want to subset Acanthisittidae and then Cotingidae (only the three squished together) and Pipridae and then Cotingidae (the next two squished together) and so on.

Further, I want to isloate the names of the bird families that have an interrupted appearance in the dataset.

I have not found a function to track transitions yet. Making each interrupted family chunk name unique and then looping through them can be possible (?) but its not efficient for a database 3k+ long with many interruptions as I will have to change them back.


Solution

  • This may help with part of your question. The consecutive_id() and mutate() functions from dplyr will give you a new column with each grouping in bird_family having a unique id. Then, you can use those values for your loop. If you provide more information about your final desired output, it may be possible to do this without using a loop. Either way, here's a solution to get you started.

    Update based on comment from OP:

    To return an alternating binary identifier, you can apply the modulo operator %% to the values created previously by the consecutive_id() function. I have created two columns for illustrative purposes. If you only require a single id column, declare "binary_id" as "id" and the values in id will be overwritten:

    library(dplyr)
    
    df <- structure(list(jetzspp = c("Acanthisitta_chloris", "Xenicus_gilviventris", 
    "Ampelioides_tschudii", "Pipreola_aureopectus", "Pipreola_chlorolepidota", 
    "Xenopipo_holochlora", "Xenopipo_uniformis", "Tityra_cayana", 
    "Tityra_inquisitor", "Tachuris_rubrigastra", "Conopias_parvus"
    ), iocorder = c("Passeriformes", "Passeriformes", "Passeriformes", 
    "Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes", 
    "Passeriformes", "Passeriformes", "Passeriformes", "Passeriformes"
    ), bird_family = c("Acanthisittidae", "Acanthisittidae", "Cotingidae", 
    "Cotingidae", "Cotingidae", "Pipridae", "Pipridae", "Cotingidae", 
    "Cotingidae", "Tyrannidae", "Tyrannidae")), row.names = c(NA, 
    -11L), class = c("tbl_df", "tbl", "data.frame"))
    
    # Add unique id and binary id to each 'run' of values
    df <- df %>%  
      mutate(id = consecutive_id(bird_family),
             binary_id = ifelse(id %% 2 == 0, 1, 0)) %>%
      data.frame()
    
    df
                       jetzspp      iocorder     bird_family id binary_id
    1     Acanthisitta_chloris Passeriformes Acanthisittidae  1         0
    2     Xenicus_gilviventris Passeriformes Acanthisittidae  1         0
    3     Ampelioides_tschudii Passeriformes      Cotingidae  2         1
    4     Pipreola_aureopectus Passeriformes      Cotingidae  2         1
    5  Pipreola_chlorolepidota Passeriformes      Cotingidae  2         1
    6      Xenopipo_holochlora Passeriformes        Pipridae  3         0
    7       Xenopipo_uniformis Passeriformes        Pipridae  3         0
    8            Tityra_cayana Passeriformes      Cotingidae  4         1
    9        Tityra_inquisitor Passeriformes      Cotingidae  4         1
    10    Tachuris_rubrigastra Passeriformes      Tyrannidae  5         0
    11         Conopias_parvus Passeriformes      Tyrannidae  5         0
    
    for(i in unique(df$id)) {
      
      print(subset(df, df$id == i))
      
    }
    
                   jetzspp      iocorder     bird_family id binary_id
    1 Acanthisitta_chloris Passeriformes Acanthisittidae  1         0
    2 Xenicus_gilviventris Passeriformes Acanthisittidae  1         0
                      jetzspp      iocorder bird_family id binary_id
    3    Ampelioides_tschudii Passeriformes  Cotingidae  2         1
    4    Pipreola_aureopectus Passeriformes  Cotingidae  2         1
    5 Pipreola_chlorolepidota Passeriformes  Cotingidae  2         1
                  jetzspp      iocorder bird_family id binary_id
    6 Xenopipo_holochlora Passeriformes    Pipridae  3         0
    7  Xenopipo_uniformis Passeriformes    Pipridae  3         0
                jetzspp      iocorder bird_family id binary_id
    8     Tityra_cayana Passeriformes  Cotingidae  4         1
    9 Tityra_inquisitor Passeriformes  Cotingidae  4         1
                    jetzspp      iocorder bird_family id binary_id
    10 Tachuris_rubrigastra Passeriformes  Tyrannidae  5         0
    11      Conopias_parvus Passeriformes  Tyrannidae  5         0