Search code examples
rif-statementdata-manipulationdata-cleaning

Large data set cleaning: How to fill in missing data based on multiple categories and searching by row order


This is my first StackOverflow post, so I hope that it isn't too difficult to understand.

I have a large dataset (~14,000) rows of bird observations. These data were collected by standing in one place (point) and counting birds that you see within 3 minutes. Within each point-count a new bird observation becomes a new row, so that there are many repeated dates, times, sites, and point (specific location within a site). Again, each point count is 3 minutes long. So if you see a yellow warbler (coded as YEWA) during minute 1, then it will be associated with MINUTE=1 for that specific point count (date, site, point, and time). ID=observer intials and Number=number of birds spotted (not necessarily important here).

However, if NO BIRDS are seen, then a "NOBI" goes into the dataset for that specific minute. Thus, if there are NOBI for an entire 3 minute point count, their will be three rows with the same date, site, point, and time, and "NOBI" in the "BIRD" column for each of the three rows.

So I have TWO main problems. The first is that some observers entered "NOBI" once for all three minutes, instead of three times (once per minute). Anywhere where "MINUTE" has been left blank (becoming NA), AND "BIRD"="NOBI", I need to add three rows of data, all with the same values for all columns except "MINUTE", which should be 1, 2, and 3 for the respective rows.

If it looks like this:

   ID     DATE SITE POINT TIME MINUTE BIRD NUMBER
1  BS 5/9/2018  CW2  U125 7:51     NA NOBI     NA
2  BS 5/9/2018  CW1  D250 8:12      1 YEWA     2
3  BS 5/9/2018  CW1  D250 8:12      2 NOBI     NA
4  BS 5/9/2018  CW1  D250 8:12      3 LABU     1

It should look like this instead:

   ID     DATE SITE POINT TIME MINUTE BIRD NUMBER
1  BS 5/9/2018  CW2  U125 7:51      1 NOBI     NA
2  BS 5/9/2018  CW2  U125 7:51      2 NOBI     NA
3  BS 5/9/2018  CW2  U125 7:51      3 NOBI     NA
4  BS 5/9/2018  CW1  D250 8:12      1 YEWA     2
5  BS 5/9/2018  CW1  D250 8:12      2 NOBI     NA
6  BS 5/9/2018  CW1  D250 8:12      3 LABU     1

note: If you are wanting to enter some of this data into your R console, I included some at the end of this post using dput, which should be easier to enter than copy-and-pasting the above

I have made failed attempts at reproducing if statements with multiple conditions (based on: R multiple conditions in if statement & Ifelse in R with multiple categorical conditions) I tried writing this many ways, including with piping from dplyr, but see below for one example of some code, notes, and error messages.

>if(PC$BIRD == "NOBI" & PC$MINUTE==NA){PC$Fix<-TRUE}
 Error in if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { : 
   missing value where TRUE/FALSE needed
 In addition: Warning message:
 In if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
   the condition has length > 1 and only the first element will be used

## Then I need to do something like this:
>if(PC$Fix<-TRUE){duplicate(row where Fix==TRUE, times=2)} #I know this isn't 
    ### even close, but I want the row to be replicated two more times so 
    ### that there are 3 total rows witht he same values
    ### Fix indicates that a fix is needed in this example
# Then somehow I need to assign a 1 to PC$MINUTE for the first row (original row), 
# a 2 to the next row (with other values from other columns being the same), and a 3 
# to the last duplicated row (still other values from other columns being the same)

The second problem, which seems more difficult to me is to search the dataset in order or perhaps by DATE,SITE,POINT, and TIME in some way. The minute values should always go from 1... to 2... to 3, and then back to 1 for the next set of date, time, site, and point. That is, each point count should have all values 1:3. However, one count may have multiple sightings in MINUTE=1 so that there are 5 or 6 (or 20) MINUTE=1 before MINUTE=2. BUT, again, some observers in this dataset simply left a row out when there was NO BIRDS (NOBI), instead of writing a row with BIRD="NOBI" for each MINUTE. That is if the dataset goes:

   ID     DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4  BS 5/9/2018  CW2  U125 7:54      1 AMRO      1
5  BS 5/9/2018  CW2  U125 7:54      1 SPTO      1
6  BS 5/9/2018  CW2  U125 7:57      1 AMRO      1
7  BS 5/9/2018  CW2  U125 7:57      1 SPTO      1
8  BS 5/9/2018  CW2  U125 7:57      1 AMCR      3
9  BS 5/9/2018  CW2  U125 7:57      2 SPTO      1
10 BS 5/9/2018  CW2  U125 7:57      2 HOWR      1
11 BS 5/9/2018  CW2  U125 7:57      3 UNBI      1

We can see that the 7:57 point count time is complete (there are MINUTE values of 1:3). However, the 7:54 point count time stops at MINUTE=1. Meaning, I need to enter two more rows underneath that have all of the same DATE,SITE,POINT,TIME information, but with MINUTE=2 and BIRD="NOBI" for the first added row and MINUTE=3 and BIRD="NOBI" for the second added row. So it should look like this:

   ID     DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4  BS 5/9/2018  CW2  U125 7:54      1 AMRO      1
5  BS 5/9/2018  CW2  U125 7:54      1 SPTO      1
6  BS 5/9/2018  CW2  U125 7:54      2 NOBI      NA
7  BS 5/9/2018  CW2  U125 7:54      3 NOBI      NA
8  BS 5/9/2018  CW2  U125 7:57      1 AMRO      1
9  BS 5/9/2018  CW2  U125 7:57      1 SPTO      1
10 BS 5/9/2018  CW2  U125 7:57      1 AMCR      3
11 BS 5/9/2018  CW2  U125 7:57      2 SPTO      1
12 BS 5/9/2018  CW2  U125 7:57      2 HOWR      1
13 BS 5/9/2018  CW2  U125 7:57      3 UNBI      1

Lastly, I understand that this is a long and complicated question, and I may not have articulated it very well. Please let me know if there is any clarification needed, and I would be happy to hear any advice, even if it doesn't fully answer my problems. Thank you in advance!


Everything below this line is only useful for you if you want to enter a sample of my data into R


To enter my data into R console, copy and paste everything from "structure" function to end of code to enter it as dataframe in R console with code: dataframe<-structure(list... See Example of using dput() for help.

PC<-read.csv("PC.csv") ### ORIGINAL FILE
dput(PC)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BS", class = "factor"), 
DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"), 
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CW2", class = "factor"), 
POINT = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M", "U125"), class = "factor"), 
TIME = structure(c(8L, 8L, 8L, 9L, 9L, 10L, 10L, 10L, 10L, 
10L, 10L, 11L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 
4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 7L), .Label = c("6:48", "6:51", 
"6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54", "7:57", 
"8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, NA, NA), BIRD = structure(c(6L, 
6L, 6L, 2L, 7L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L, 6L, 6L, 
6L, 6L, 7L, 7L, 7L, 7L, 6L, 8L, 3L, 7L, 9L, 5L, 4L, 2L, 6L, 
6L), .Label = c("AMCR", "AMRO", "BRSP", "DUFL", "HOWR", "NOBI", 
"SPTO", "UNBI", "VESP"), class = "factor"), NUMBER = c(NA, 
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, 
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 
NA)), class = "data.frame", row.names = c(NA, -32L))


PCc<-read.csv("PC_Corrected.csv")  #### WHAT I NEED MY DATABASE TO LOOK LIKE
dput(PCc)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = "BS", class = "factor"), DATE = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"), 
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = "CW2", class = "factor"), POINT = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M", 
"U125"), class = "factor"), TIME = structure(c(8L, 8L, 8L, 
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("6:48", 
"6:51", "6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54", 
"7:57", "8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L, 
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 
2L, 3L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), BIRD = structure(c(6L, 
6L, 6L, 2L, 7L, 6L, 6L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 6L, 6L, 7L, 7L, 6L, 8L, 3L, 
7L, 9L, 5L, 4L, 2L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("AMCR", 
"AMRO", "BRSP", "DUFL", "HOWR", "NOBI", "SPTO", "UNBI", "VESP"
), class = "factor"), NUMBER = c(NA, NA, NA, 1L, 1L, NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, 1L, 1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-42L))

Solution

  • Here's a way to do it using dplyr and tidyr from the tidyverse meta-package.

    # Step one - identify missing rows.
    #    For each DATE, SITE, POINT, TIME, count how many of each minute 
    library(tidyverse)
    
    # Convert factors to character to make later joining simpler, 
    #   and fix missing ID's by assuming prior line should be used,
    #   and make NOBI rows have a count of NA
    PC_2_clean <- PC %>%
      mutate_if(is.factor, as.character) %>%
      fill(ID, .direction = "up") %>%
      mutate(NUMBER = if_else(BIRD == "NOBI", NA_integer_, NUMBER))
    
    
    # Create a wide table with spots for each minute. Missing will
    #   show up as NA's
    # All the NA's here in the 1, 2, and 3 columns represent 
    #   missing minutes that we should add.
    PC_3_NA_find <- PC_2_clean %>%
      count(ID, DATE, SITE, POINT, TIME, MINUTE) %>%
      spread(MINUTE, n)
    
    PC_3_NA_find
    # A tibble: 11 x 9
    # ID    DATE     SITE  POINT TIME    `1`   `2`   `3` `<NA>`
    # <chr> <chr>    <chr> <chr> <chr> <int> <int> <int>  <int>
    #   1 BS    5/9/2018 CW2   M     7:12      3     1     2     NA
    # 2 BS    5/9/2018 CW2   M     7:15     NA    NA    NA      1
    # 3 BS    5/9/2018 CW2   M     7:18     NA    NA    NA      1
    # 4 BS    5/9/2018 CW2   U125  6:48      1     1     1     NA
    # 5 BS    5/9/2018 CW2   U125  6:51      1     1     1     NA
    # 6 BS    5/9/2018 CW2   U125  6:54      2    NA    NA     NA
    # 7 BS    5/9/2018 CW2   U125  6:57      2     1     1     NA
    # 8 BS    5/9/2018 CW2   U125  7:51      1     1     1     NA
    # 9 BS    5/9/2018 CW2   U125  7:54      2    NA    NA     NA
    # 10 BS    5/9/2018 CW2   U125  7:57      3     2     1     NA
    # 11 BS    5/9/2018 CW2   U125  8:00      1    NA    NA     NA
    
    
    # Take the NA minute entries and make the desired line for each
    PC_4_rows_to_add <- PC_3_NA_find %>%
      gather(MINUTE, count, `1`:`3`) %>%
      filter(is.na(count)) %>%
      select(-count, -`<NA>`) %>%
    
      mutate(MINUTE = as.integer(MINUTE),
             BIRD = "NOBI",
             NUMBER = NA_integer_)
    
    
    # Add these lines to the original,  remove the NA minute rows 
    #   (these have been replaced with minute rows), and sort
    PC_5_with_NOBIs <- PC_2_clean %>%
      bind_rows(PC_4_rows_to_add) %>%
      filter(MINUTE != "NA") %>%
      arrange(ID, DATE, SITE, POINT, TIME, MINUTE, BIRD)
    
    
    # Check result
    PC_5_with_NOBIs  %>%
      count(ID, DATE, SITE, POINT, TIME, MINUTE) %>%
      spread(MINUTE, n)
    
    PC_5_with_NOBIs
    
    
    
    # Now to confirm it matches your desired output.
    #   Note, I convert to character to avoid mismatches between factors
    PCc_char <- PCc %>%
      mutate_if(is.factor, as.character) %>%
      arrange(ID, DATE, SITE, POINT, TIME, MINUTE, BIRD)
    
    identical(PC_5_with_NOBIs, PCc_char)
    # [1] TRUE