I have a dataset derived from Pokemon statistics containing a lot of the numerical and categorical data. My end goal is to create a model or recommendation system that a user can input a list of Pokemon and the model finds similar Pokemon they may like. Currently the dataset looks something like this:
ID Name Type1 Type2 HP
001 Bulba.. Grass Poison 45
ect...
I understand the type1/type2 metric might be problematic, Is there a function that would let me create a new create/modify new columns were if a Pokemon had a particular type it would add a logical value(0 for false, 1 for true) in that new column?
I apologize for a lack luster explanation but what I want is for my dataset to look like this:
ID Name Grass Poison Water HP
001 Bulba.. 1 1 0 45
ect...
tidyr is a package for data reshaping. Here, we'll use pivot_longer()
to put it into a long format, where the type names (Type1, Type2) will reside in column "name", while the values (Grass, Poison, etc.) will reside in column "value". We
filter out rows with is.na(value)
because that means the pokemon did not have a second type. We create an indicator variable -- this gets a 1. Each pokemon will then have indicator == 1
for the types it has. We drop the now extraneous "name" column, and use pivot_wider()
to transform each unique value in value
into its own column, which will receive indicator
's value as the cell value for each row. Finally, we mutate on all numeric columns to replace missings with 0, since we know those pokemon aren't those types.
A better solution than mutate_if(is.numeric, ...)
would be to compute the unique values of types and use mutate_at(vars(pokemon_types), ...
. This would not affect other numeric columns unintentionally.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
pokemon <- tibble(ID = c(1,2), Name = c("Bulbasaur", "Squirtle"),
Type1 = c("Grass", "Water"),
Type2 = c("Poison", NA),
HP = c(40, 50))
pokemon %>% pivot_longer(
starts_with("Type")
) %>%
filter(!is.na(value)) %>%
mutate(indicator = 1) %>%
select(-name) %>%
pivot_wider(names_from = value, values_from = indicator,
) %>%
mutate_if(is.numeric, .funs = function(x) if_else(is.na(x), 0, x))
#> # A tibble: 2 x 6
#> ID Name HP Grass Poison Water
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Bulbasaur 40 1 1 0
#> 2 2 Squirtle 50 0 0 1