Search code examples
rdplyrsumsummarize

R Dplyr Summarize w/ Sum is not counting some rows. Why?


I am new to stackoverflow, but not new to googling the heck out of my R script issues. This one has me stumped.

I am trying to find out why my current script doesn't work:

SCRIPT: My example is a tibble (called "Squishframe") of 13 columns (11 of them to be treated as characters and unique variables; 2 of them to be numeric). I group by the character columns and sum all the other numeric columns using the summarize (and Sum) functions. This new tibble is called "Groupframe".

The main issue appears to be that:

  • when there is a row with unique variables AND numeric data,
  • shared with another row with the same unique variables AND BLANK numeric data,
  • the row with numeric data is ignored!

    #Create pipeline that first turns the Squishframe number columns into numeric type, and then groups the data by the Text-based columns
    Groupedframe <- Squishframe %>% 
      mutate_at(vars(AF, Billable), funs(gsub("\\$", "", .))) %>%
      mutate_at(vars(AF, Billable), funs(as.numeric)) %>%
      group_by_at(vars(`File Name`, Source, Date, Colour, Size, MediumSource, Campaign, Version, Signed, CountryCode, State)) %>%
    
    #Summarize all non-text-based columns by groups (sum function used)
     summarize_all(funs(sum))
    

If you would like a zip of my live files as an example, I have them, but I'm unsure how to provide them right now.

It is a little different from my above, as it imports from an Raw Data folder, and deposits the final dataset into the output folder.


Input: A tibble called "Squishframe" in the script above. I am also aware some of these columns are empty. This is intentional:

structure(list(`File Name` = c("Bricks.csv", "Bricks.csv", "Bricks.csv", 
"Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", 
"Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv"
), Source = c("Installations", "Installations", "Installations", 
"Installations", "Installations", "Installations", "Installations", 
"Installations", "Installations", "Installations", "Installations", 
"Installations", "Installations"), Date = structure(c(17934, 
17934, 17934, 17934, 17945, 17953, 17956, 17957, 17945, 17945, 
17945, 17945, 17945), class = "Date"), Colour = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Size = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MediumSource = c("Baliest", 
"Baliest", "Baliest", "Baliest", "Baliest", "Baliest", "Baliest", 
"Baliest", "Klimt", "Klimt", "Klimt", "Klimt", "Klimt"), Campaign = c("Big Campaign Baliest", 
"Big Campaign Baliest", "Big Campaign Baliest", "Big Campaign Baliest", 
"Big Campaign Baliest", "Big Campaign Baliest", "Big Campaign Baliest", 
"Big Campaign Baliest", "Big Campaign Klimt", "Big Campaign Klimt", 
"Big Campaign Klimt", "Big Campaign Klimt", "Big Campaign Klimt"
), Version = c("Creative_V2", "Creative_V1", "Creative_V3", "Creative_V2", 
"Creative_V1", "Creative_V3", "Creative_V2", "Creative_V1", "Creative_V1", 
"Creative_V1", "Creative_V1", "Creative_V1", "Creative_V1"), 
    Signed = c("Tigle iOS", "Tigle iOS", "Tigle iOS", "Tigle iOS", 
    "Tigle iOS", "Tigle iOS", "Tigle iOS", "Tigle iOS", "200289", 
    "200289", "200289", "200289", "200289"), CountryCode = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), State = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), AF = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Billable = c(NA, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA)), row.names = c(NA, 
-13L), class = c("tbl_df", "tbl", "data.frame"))

EXPECTED RESULT:

My script tries to compress data. It should consolidate all numeric data into rows defined by ONLY unique variables. In other words, if a row has unique variables, that row should store all the numeric data, instead of its original state where there are multiple rows with the same variables. I should get the same numeric column totals, just with less rows.

I should get a resulting dataframe with a sum of 8 in the "Billable" column. And no duplicates.

File Name   Source  Date    Colour  Size    MediumSource    Campaign    Version Signed  CountryCode State   AF  Billable
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               1
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Klimt   Big Campaign Klimt  Creative_V1 200289              1
Bricks.csv  Installations   2/26/2019           Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   3/1/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               1
Bricks.csv  Installations   3/2/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1

CURRENT RESULT: For the most part, it is successful. But I have a case where the script that happily sums all other rows, ignores some. This means that some numeric data is overlooked completely, despite sharing variables with another row. This results in data loss. I get less rows, but the column totals are less. Blanks appear in the "Billable" column where there should be data for those unique combination of variables. I am not trying to remove these blanks, these blanks should be 1s instead:

File Name   Source  Date    Colour  Size    MediumSource    Campaign    Version Signed  CountryCode State   AF  Billable
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Klimt   Big Campaign Klimt  Creative_V1 200289              
Bricks.csv  Installations   2/26/2019           Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   3/1/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               1
Bricks.csv  Installations   3/2/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1

To me, it seems that summing blank+1 in these cases = blank. It should be 1. The current billable column sums to 6 instead of 8 as it should be as afmorementioned.


Thank you all for your time! I am grateful on any feedback on my script and my methods of writing this post. If it can be bettered in any way, I am open to it!


Solution

  • The sum of Billable is 8 with either of these implementations. Since you have edited the question several times, you may want to consider deleting and asking a new one.

    library(tidyverse)
    
    res1 <- df %>% 
      mutate_at(vars(AF, Billable), ~ gsub("\\$", "", .)) %>% 
      mutate_at(vars(AF, Billable), ~ as.numeric(.)) %>%
      group_by_at(vars(`File Name`:State)) %>%
      summarize_all(~ sum(., na.rm = TRUE))
    res1
    #> # A tibble: 8 x 13
    #> # Groups:   File Name, Source, Date, Colour, Size, MediumSource, Campaign,
    #> #   Version, Signed, CountryCode [8]
    #>   `File Name` Source Date       Colour Size  MediumSource Campaign Version
    #>   <chr>       <chr>  <date>     <lgl>  <lgl> <chr>        <chr>    <chr>  
    #> 1 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
    #> 2 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
    #> 3 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
    #> 4 Bricks.csv  Insta~ 2019-02-18 NA     NA    Baliest      Big Cam~ Creati~
    #> 5 Bricks.csv  Insta~ 2019-02-18 NA     NA    Klimt        Big Cam~ Creati~
    #> 6 Bricks.csv  Insta~ 2019-02-26 NA     NA    Baliest      Big Cam~ Creati~
    #> 7 Bricks.csv  Insta~ 2019-03-01 NA     NA    Baliest      Big Cam~ Creati~
    #> 8 Bricks.csv  Insta~ 2019-03-02 NA     NA    Baliest      Big Cam~ Creati~
    #> # ... with 5 more variables: Signed <chr>, CountryCode <lgl>, State <lgl>,
    #> #   AF <dbl>, Billable <dbl>
    sum(res1$Billable) == 8
    #> [1] TRUE
    
    res2 <- df %>%
      distinct() %>%
      filter(! is.na(Billable))
    res2
    #> # A tibble: 8 x 13
    #>   `File Name` Source Date       Colour Size  MediumSource Campaign Version
    #>   <chr>       <chr>  <date>     <lgl>  <lgl> <chr>        <chr>    <chr>  
    #> 1 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
    #> 2 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
    #> 3 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
    #> 4 Bricks.csv  Insta~ 2019-02-18 NA     NA    Baliest      Big Cam~ Creati~
    #> 5 Bricks.csv  Insta~ 2019-02-26 NA     NA    Baliest      Big Cam~ Creati~
    #> 6 Bricks.csv  Insta~ 2019-03-01 NA     NA    Baliest      Big Cam~ Creati~
    #> 7 Bricks.csv  Insta~ 2019-03-02 NA     NA    Baliest      Big Cam~ Creati~
    #> 8 Bricks.csv  Insta~ 2019-02-18 NA     NA    Klimt        Big Cam~ Creati~
    #> # ... with 5 more variables: Signed <chr>, CountryCode <lgl>, State <lgl>,
    #> #   AF <lgl>, Billable <int>
    sum(res2$Billable) == 8
    #> [1] TRUE