I'm having trouble finding what percent of Canada geese get killed during migration season using the Airplane Strikes data set.
#airline stats table
airlines <- sd4 %>%
group_by(STATE) %>%
filter(SPECIES == "Canada goose" & total_kills > 1) %>%
mutate(fall_mig_kills = ifelse(SPECIES=="Canada goose" & INCIDENT_MONTH %in% c(9,10,11),total_kills,0)) %>%
summarise(
pct_mig_kills = fall_mig_kills/total_kills
) %>%
select(STATE,SPECIES,INCIDENT_MONTH,total_kills,fall_mig_kills,pct_mig_kills)`
here is where i'm getting the error:
summarise(
pct_mig_kills = fall_mig_kills/total_kills
)
and the error is:
Error in summarise_impl(.data, dots) :
Column `pct_mig_kills` must be length 1 (a summary value), not 10
not sure how i'm getting a value longer than length 1 when dividing two integer columns.
any help would be appreciated!
Benjamin
Let's read the data, document everything, and see where your error arises.
In general, you should have a link to your original dataset or provide a shortened version to follow the reproducibility principle. I found an aircraft wildlife strikes, 1990-2015 dataset on Kaggle, which I will be using here. Note: You will need to have a Kaggle account to download the data. It may also be available at data.gov.
library(dplyr)
df <- read.csv("~/../Downloads/database.csv", stringsAsFactors = F)
> df$Species.Name[grepl("Canada goose", df$Species.Name, ignore.case = T)][1]
[1] "CANADA GOOSE"
> names(df)
[1] "Record.ID" "Incident.Year" "Incident.Month"
[4] "Incident.Day" "Operator.ID" "Operator"
[7] "Aircraft" "Aircraft.Type" "Aircraft.Make"
[10] "Aircraft.Model" "Aircraft.Mass" "Engine.Make"
[13] "Engine.Model" "Engines" "Engine.Type"
[16] "Engine1.Position" "Engine2.Position" "Engine3.Position"
[19] "Engine4.Position" "Airport.ID" "Airport"
[22] "State" "FAA.Region" "Warning.Issued"
[25] "Flight.Phase" "Visibility" "Precipitation"
[28] "Height" "Speed" "Distance"
[31] "Species.ID" "Species.Name" "Species.Quantity"
[34] "Flight.Impact" "Fatalities" "Injuries"
[37] "Aircraft.Damage" "Radome.Strike" "Radome.Damage"
[40] "Windshield.Strike" "Windshield.Damage" "Nose.Strike"
[43] "Nose.Damage" "Engine1.Strike" "Engine1.Damage"
[46] "Engine2.Strike" "Engine2.Damage" "Engine3.Strike"
[49] "Engine3.Damage" "Engine4.Strike" "Engine4.Damage"
[52] "Engine.Ingested" "Propeller.Strike" "Propeller.Damage"
[55] "Wing.or.Rotor.Strike" "Wing.or.Rotor.Damage" "Fuselage.Strike"
[58] "Fuselage.Damage" "Landing.Gear.Strike" "Landing.Gear.Damage"
[61] "Tail.Strike" "Tail.Damage" "Lights.Strike"
[64] "Lights.Damage" "Other.Strike" "Other.Damage"
[67] "totalKills"
Notice that the species name is in ALL CAPITAL LETTERS. Use grepl
instead of ==
unless you are certain you know the name verbatim.
There is no total_kills
variable, and the Fatalities
variable represents human fatalities, so I'm going to ignore that filter variable. What I did find was Species.Quantity
, which is probably what you are looking for, the total number of species killed during an incident.
> unique(df$Species.Quantity)
[1] "1" "2-10" "" "11-100" "Over 100"
We can convert these values to numerics for this example.
> dictNames <- unique(df$Species.Quantity)
> dict <- c(1, 2, 0, 11, 100)
> names(dict) <- dictNames
> dict['1']
1
1
> dict['2-10']
2-10
2
> df <- df %>% mutate(totalKills = dict[Species.Quantity])
> table(df$totalKills, useNA = "always")
1 2 11 100 <NA>
146563 21852 1166 46 4477
Great, now let's look at your code.
> df %>%
+ group_by(State) %>%
+ filter(Species.Name == "CANADA GOOSE" & totalKills > 1) %>%
+ mutate(fall_mig_kills = ifelse(Species.Name == "CANADA GOOSE" &
+ Incident.Month %in% c(9,10,11),
+ totalKills,
+ 0)
+ ) %>%
+ summarise(
+ pct_mig_kills = fall_mig_kills/totalKills
+ )
Error in summarise_impl(.data, dots) :
Column `pct_mig_kills` must be length 1 (a summary value), not 19
Hmm, let's see why that is. Reading the help menu by typing in ?summarise
in the console says:
summarise {dplyr} R Documentation Reduces multiple values down to a single value
Description
summarise() is typically used on grouped data created by group_by(). The output will have one row for each group.
Okay, so the output will have one row for each group. Since you have grouped a variable, we need to sum the total kills. Furthermore, you may want to create a new variable "inSeason" which will allow you to summarise your data appropriately.
So, to fix your issue, you simply add in sum
:
+ summarise(
+ pct_mig_kills = sum(fall_mig_kills)/sum(totalKills)
+ )
# A tibble: 49 x 2
State pct_mig_kills
<chr> <dbl>
1 0.70212766
2 AK 0.50000000
3 AL 0.00000000
4 AR 1.00000000
5 CA 0.06185567
Now let's change it to read slightly easier. And you care about season, not state.
> df %>%
+ # inSeason = seasons we care about monitoring
+ # totalKills has NA values, we choose to put deaths at 0
+ mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+ totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+ # canadian geese only
+ filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+ # collect data by inSeason
+ group_by(inSeason) %>%
+ # sum them up
+ summarise(totalDead = sum(totalKills)) %>%
+ # add a ratio value
+ mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 2 x 3
inSeason totalDead percentDead
<chr> <dbl> <dbl>
1 in 838 34
2 out 1620 66
Now you have in season versus out of season, total dead, and a percentage. If you want to add in state, add that variable into your groupings.
One other note, group_by
with a summarise
automatically removes the other columns, so you do not need to use select
at the end.
> df %>%
+ mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
+ totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
+ filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
+ group_by(State, inSeason) %>%
+ summarise(totalDead = sum(totalKills)) %>%
+ mutate(percentDead = round(100*totalDead/sum(totalDead),0))
# A tibble: 98 x 4
# Groups: State [51]
State inSeason totalDead percentDead
<chr> <chr> <dbl> <dbl>
1 in 52 52
2 out 48 48
3 AB in 1 50
4 AB out 1 50
5 AK in 13 33
6 AK out 26 67
7 AL in 2 40
8 AL out 3 60
9 AR in 6 100
10 CA in 13 8