I am new to stackoverflow, but not new to googling the heck out of my R script issues. This one has me stumped.
I am trying to find out why my current script doesn't work:
SCRIPT: My example is a tibble (called "Squishframe") of 13 columns (11 of them to be treated as characters and unique variables; 2 of them to be numeric). I group by the character columns and sum all the other numeric columns using the summarize (and Sum) functions. This new tibble is called "Groupframe".
The main issue appears to be that:
the row with numeric data is ignored!
#Create pipeline that first turns the Squishframe number columns into numeric type, and then groups the data by the Text-based columns
Groupedframe <- Squishframe %>%
mutate_at(vars(AF, Billable), funs(gsub("\\$", "", .))) %>%
mutate_at(vars(AF, Billable), funs(as.numeric)) %>%
group_by_at(vars(`File Name`, Source, Date, Colour, Size, MediumSource, Campaign, Version, Signed, CountryCode, State)) %>%
#Summarize all non-text-based columns by groups (sum function used)
summarize_all(funs(sum))
If you would like a zip of my live files as an example, I have them, but I'm unsure how to provide them right now.
It is a little different from my above, as it imports from an Raw Data folder, and deposits the final dataset into the output folder.
Input: A tibble called "Squishframe" in the script above. I am also aware some of these columns are empty. This is intentional:
structure(list(`File Name` = c("Bricks.csv", "Bricks.csv", "Bricks.csv",
"Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv",
"Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv"
), Source = c("Installations", "Installations", "Installations",
"Installations", "Installations", "Installations", "Installations",
"Installations", "Installations", "Installations", "Installations",
"Installations", "Installations"), Date = structure(c(17934,
17934, 17934, 17934, 17945, 17953, 17956, 17957, 17945, 17945,
17945, 17945, 17945), class = "Date"), Colour = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Size = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MediumSource = c("Baliest",
"Baliest", "Baliest", "Baliest", "Baliest", "Baliest", "Baliest",
"Baliest", "Klimt", "Klimt", "Klimt", "Klimt", "Klimt"), Campaign = c("Big Campaign Baliest",
"Big Campaign Baliest", "Big Campaign Baliest", "Big Campaign Baliest",
"Big Campaign Baliest", "Big Campaign Baliest", "Big Campaign Baliest",
"Big Campaign Baliest", "Big Campaign Klimt", "Big Campaign Klimt",
"Big Campaign Klimt", "Big Campaign Klimt", "Big Campaign Klimt"
), Version = c("Creative_V2", "Creative_V1", "Creative_V3", "Creative_V2",
"Creative_V1", "Creative_V3", "Creative_V2", "Creative_V1", "Creative_V1",
"Creative_V1", "Creative_V1", "Creative_V1", "Creative_V1"),
Signed = c("Tigle iOS", "Tigle iOS", "Tigle iOS", "Tigle iOS",
"Tigle iOS", "Tigle iOS", "Tigle iOS", "Tigle iOS", "200289",
"200289", "200289", "200289", "200289"), CountryCode = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), State = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), AF = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Billable = c(NA,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA)), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
EXPECTED RESULT:
My script tries to compress data. It should consolidate all numeric data into rows defined by ONLY unique variables. In other words, if a row has unique variables, that row should store all the numeric data, instead of its original state where there are multiple rows with the same variables. I should get the same numeric column totals, just with less rows.
I should get a resulting dataframe with a sum of 8 in the "Billable" column. And no duplicates.
File Name Source Date Colour Size MediumSource Campaign Version Signed CountryCode State AF Billable
Bricks.csv Installations 2/7/2019 Baliest Big Campaign Baliest Creative_V1 Tigle iOS 1
Bricks.csv Installations 2/7/2019 Baliest Big Campaign Baliest Creative_V2 Tigle iOS 1
Bricks.csv Installations 2/7/2019 Baliest Big Campaign Baliest Creative_V3 Tigle iOS 1
Bricks.csv Installations 2/18/2019 Baliest Big Campaign Baliest Creative_V1 Tigle iOS 1
Bricks.csv Installations 2/18/2019 Klimt Big Campaign Klimt Creative_V1 200289 1
Bricks.csv Installations 2/26/2019 Baliest Big Campaign Baliest Creative_V3 Tigle iOS 1
Bricks.csv Installations 3/1/2019 Baliest Big Campaign Baliest Creative_V2 Tigle iOS 1
Bricks.csv Installations 3/2/2019 Baliest Big Campaign Baliest Creative_V1 Tigle iOS 1
CURRENT RESULT: For the most part, it is successful. But I have a case where the script that happily sums all other rows, ignores some. This means that some numeric data is overlooked completely, despite sharing variables with another row. This results in data loss. I get less rows, but the column totals are less. Blanks appear in the "Billable" column where there should be data for those unique combination of variables. I am not trying to remove these blanks, these blanks should be 1s instead:
File Name Source Date Colour Size MediumSource Campaign Version Signed CountryCode State AF Billable
Bricks.csv Installations 2/7/2019 Baliest Big Campaign Baliest Creative_V1 Tigle iOS 1
Bricks.csv Installations 2/7/2019 Baliest Big Campaign Baliest Creative_V2 Tigle iOS
Bricks.csv Installations 2/7/2019 Baliest Big Campaign Baliest Creative_V3 Tigle iOS 1
Bricks.csv Installations 2/18/2019 Baliest Big Campaign Baliest Creative_V1 Tigle iOS 1
Bricks.csv Installations 2/18/2019 Klimt Big Campaign Klimt Creative_V1 200289
Bricks.csv Installations 2/26/2019 Baliest Big Campaign Baliest Creative_V3 Tigle iOS 1
Bricks.csv Installations 3/1/2019 Baliest Big Campaign Baliest Creative_V2 Tigle iOS 1
Bricks.csv Installations 3/2/2019 Baliest Big Campaign Baliest Creative_V1 Tigle iOS 1
To me, it seems that summing blank+1 in these cases = blank. It should be 1. The current billable column sums to 6 instead of 8 as it should be as afmorementioned.
Thank you all for your time! I am grateful on any feedback on my script and my methods of writing this post. If it can be bettered in any way, I am open to it!
The sum of Billable
is 8 with either of these implementations. Since you have edited the question several times, you may want to consider deleting and asking a new one.
library(tidyverse)
res1 <- df %>%
mutate_at(vars(AF, Billable), ~ gsub("\\$", "", .)) %>%
mutate_at(vars(AF, Billable), ~ as.numeric(.)) %>%
group_by_at(vars(`File Name`:State)) %>%
summarize_all(~ sum(., na.rm = TRUE))
res1
#> # A tibble: 8 x 13
#> # Groups: File Name, Source, Date, Colour, Size, MediumSource, Campaign,
#> # Version, Signed, CountryCode [8]
#> `File Name` Source Date Colour Size MediumSource Campaign Version
#> <chr> <chr> <date> <lgl> <lgl> <chr> <chr> <chr>
#> 1 Bricks.csv Insta~ 2019-02-07 NA NA Baliest Big Cam~ Creati~
#> 2 Bricks.csv Insta~ 2019-02-07 NA NA Baliest Big Cam~ Creati~
#> 3 Bricks.csv Insta~ 2019-02-07 NA NA Baliest Big Cam~ Creati~
#> 4 Bricks.csv Insta~ 2019-02-18 NA NA Baliest Big Cam~ Creati~
#> 5 Bricks.csv Insta~ 2019-02-18 NA NA Klimt Big Cam~ Creati~
#> 6 Bricks.csv Insta~ 2019-02-26 NA NA Baliest Big Cam~ Creati~
#> 7 Bricks.csv Insta~ 2019-03-01 NA NA Baliest Big Cam~ Creati~
#> 8 Bricks.csv Insta~ 2019-03-02 NA NA Baliest Big Cam~ Creati~
#> # ... with 5 more variables: Signed <chr>, CountryCode <lgl>, State <lgl>,
#> # AF <dbl>, Billable <dbl>
sum(res1$Billable) == 8
#> [1] TRUE
res2 <- df %>%
distinct() %>%
filter(! is.na(Billable))
res2
#> # A tibble: 8 x 13
#> `File Name` Source Date Colour Size MediumSource Campaign Version
#> <chr> <chr> <date> <lgl> <lgl> <chr> <chr> <chr>
#> 1 Bricks.csv Insta~ 2019-02-07 NA NA Baliest Big Cam~ Creati~
#> 2 Bricks.csv Insta~ 2019-02-07 NA NA Baliest Big Cam~ Creati~
#> 3 Bricks.csv Insta~ 2019-02-07 NA NA Baliest Big Cam~ Creati~
#> 4 Bricks.csv Insta~ 2019-02-18 NA NA Baliest Big Cam~ Creati~
#> 5 Bricks.csv Insta~ 2019-02-26 NA NA Baliest Big Cam~ Creati~
#> 6 Bricks.csv Insta~ 2019-03-01 NA NA Baliest Big Cam~ Creati~
#> 7 Bricks.csv Insta~ 2019-03-02 NA NA Baliest Big Cam~ Creati~
#> 8 Bricks.csv Insta~ 2019-02-18 NA NA Klimt Big Cam~ Creati~
#> # ... with 5 more variables: Signed <chr>, CountryCode <lgl>, State <lgl>,
#> # AF <lgl>, Billable <int>
sum(res2$Billable) == 8
#> [1] TRUE