Search code examples
rdataframedplyrdata-wranglingacoustics

Data wrangling problem with labelled sound files


Let's say I have a large dataframe with a column for 'soundfile' and then 'start and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

## setup example dataframe
id <- c("soundfile_1","soundfile_2","soundfile_3")
sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
names(sound_df)[1] <- "soundfile"
names(sound_df)[2] <- "sound_start" 
names(sound_df)[3] <- "sound_end"
print(sound_df)

    soundfile sound_start sound_end
1 soundfile_1           0       3.2
2 soundfile_1         8.0      14.1
3 soundfile_2         3.3       3.8
4 soundfile_2        11.7      12.8
5 soundfile_3         4.6       5.9
6 soundfile_3        13.1      14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

     soundfile start end present
1  soundfile_1     0   3     yes
2  soundfile_1     3   6     yes
3  soundfile_1     6   9     yes
4  soundfile_1     9  12     yes
5  soundfile_1    12  15     yes
6  soundfile_2     0   3      no
7  soundfile_2     3   6     yes
8  soundfile_2     6   9      no
9  soundfile_2     9  12     yes
10 soundfile_2    12  15     yes
11 soundfile_3     0   3      no
12 soundfile_3     3   6     yes
13 soundfile_3     6   9      no
14 soundfile_3     9  12      no
15 soundfile_3    12  15     yes

Solution

  • Sounds like a case for a data.table non-equi join.

    Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

    library(data.table)
    setDT(sound_df)
    sound_df[, `:=`(
        sound_start_for_join = as.numeric(sound_start),
        sound_end_for_join = as.numeric(sound_end)
    )]
    

    Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

    CHUNK_SECONDS <- 3
    FILE_SECONDS <- 15
    time_windows <- CJ(
        soundfile = sound_df$soundfile,
        start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
        unique = TRUE
    )[, end := start + CHUNK_SECONDS][
        ,
        `:=`(
            start_for_join = start,
            end_for_join = end
        )
    ]
    
    time_windows
    
    #       soundfile start   end start_for_join end_for_join
    #          <char> <num> <num>          <num>        <num>
    #  1: soundfile_1     0     3              0            3
    #  2: soundfile_1     3     6              3            6
    #  3: soundfile_1     6     9              6            9
    #  4: soundfile_1     9    12              9           12
    #  5: soundfile_1    12    15             12           15
    # <etc>
    

    Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

    sound_out <- sound_df[
        time_windows,
        on = .(
            soundfile,
            sound_start_for_join < end_for_join,
            sound_end_for_join > start_for_join
        )
    ][, .(
        soundfile, start, end,
        present = !is.na(sound_start)
    )]
    

    Which produces the desired output:

          soundfile start   end present
             <char> <num> <num>  <lgcl>
     1: soundfile_1     0     3    TRUE
     2: soundfile_1     3     6    TRUE
     3: soundfile_1     6     9    TRUE
     4: soundfile_1     9    12    TRUE
     5: soundfile_1    12    15    TRUE
     6: soundfile_2     0     3   FALSE
     7: soundfile_2     3     6    TRUE
     8: soundfile_2     6     9   FALSE
     9: soundfile_2     9    12    TRUE
    10: soundfile_2    12    15    TRUE
    11: soundfile_3     0     3   FALSE
    12: soundfile_3     3     6    TRUE
    13: soundfile_3     6     9   FALSE
    14: soundfile_3     9    12   FALSE
    15: soundfile_3    12    15    TRUE
    

    Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of "yes" and "no" you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), "yes", "no").