Search code examples
rfunctionwrangle

Is there an R function that can convert a existing metric into a new logical metric?


I have a dataset derived from Pokemon statistics containing a lot of the numerical and categorical data. My end goal is to create a model or recommendation system that a user can input a list of Pokemon and the model finds similar Pokemon they may like. Currently the dataset looks something like this:

ID   Name    Type1    Type2   HP 
001  Bulba.. Grass    Poison  45
ect...

I understand the type1/type2 metric might be problematic, Is there a function that would let me create a new create/modify new columns were if a Pokemon had a particular type it would add a logical value(0 for false, 1 for true) in that new column?

I apologize for a lack luster explanation but what I want is for my dataset to look like this:

ID   Name    Grass  Poison Water  HP 
001  Bulba..    1      1     0    45
ect...

Solution

  • tidyr is a package for data reshaping. Here, we'll use pivot_longer() to put it into a long format, where the type names (Type1, Type2) will reside in column "name", while the values (Grass, Poison, etc.) will reside in column "value". We filter out rows with is.na(value) because that means the pokemon did not have a second type. We create an indicator variable -- this gets a 1. Each pokemon will then have indicator == 1 for the types it has. We drop the now extraneous "name" column, and use pivot_wider() to transform each unique value in value into its own column, which will receive indicator's value as the cell value for each row. Finally, we mutate on all numeric columns to replace missings with 0, since we know those pokemon aren't those types. A better solution than mutate_if(is.numeric, ...) would be to compute the unique values of types and use mutate_at(vars(pokemon_types), .... This would not affect other numeric columns unintentionally.

    library(tidyr)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    pokemon <- tibble(ID = c(1,2), Name = c("Bulbasaur", "Squirtle"),
                      Type1 = c("Grass", "Water"), 
                      Type2 = c("Poison", NA),
                      HP = c(40, 50))
    
    pokemon %>% pivot_longer(
      starts_with("Type")
    ) %>% 
      filter(!is.na(value)) %>% 
      mutate(indicator = 1) %>% 
      select(-name) %>% 
      pivot_wider(names_from = value, values_from = indicator,
                  ) %>% 
    
      mutate_if(is.numeric, .funs = function(x) if_else(is.na(x), 0, x))
    #> # A tibble: 2 x 6
    #>      ID Name         HP Grass Poison Water
    #>   <dbl> <chr>     <dbl> <dbl>  <dbl> <dbl>
    #> 1     1 Bulbasaur    40     1      1     0
    #> 2     2 Squirtle     50     0      0     1