Search code examples
rregexpattern-matchingdummy-variable

Making a variable from detection of any multiple string pattern in one string


end_result_tbl

This end_result_tbl is an example from a different voter file in ideal format.

ID     GEN_16  GEN_14  GEN_08  PP_16  PR_16  PR_15  PR_14
0001     1       1       1       1      0      0       0
0002     0       0       0       0      1      0       1
0003     1       1       1       0      0      0       0
0004     1       0       1       0      0      0       1
0005     1       0       1       1      1      0       1

raw_data_tbl

ID     Voter_History  
0001   GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
0002   2016 GENERAL ELECTION;2014 GENERAL ELECTION
0003   20121106 GE;20081104 GE;General Election 2006
0004   GE 20150910
0005   16 GENERAL ELECTION; 14 PRIMARY ELECTION

Looking to make variables for each election out of conditional string matches for each string of text.

Each election has about 9 iterations. if one iteration is matched for an election, then a "1" is placed to show a VOTE in that election, if none are matched, then a "0" for a NO VOTE.

Below are the iterations for the 2016 November General Election

GEN_16<-c("20161108 GE",
          "16 GENERAL ELECTION",
          "GENERAL 2016",
          "GENERAL ELECTION 2016", 
          "2016 GENERAL ELECTION", 
          "GENERAL ELECTION, 2016",
          "16 GENERAL ELECTION",
          "GE 20161108")

Here is what I have tried (attempting only 2016 General Election):

 raw_data_tbl$GEN_16<- 
 as.integer(stri_detect(raw_data_tbl$Voter_History,GEN_16))

 which(GEN_16%in%raw_data_tbl$Voter_History

require(dplyr)
Sequences <- GEN_16
Database <- raw_data_tabl$Voter_History

df=as.data.frame(sapply(Sequences, function(x) grep(x,Database)))
stats=df %>% summarise_all(funs(sum))
cbind(Sequences,as.numeric(stats))

this is actually a quite simple albeit super long code in sql but find it's equivalent in R hard to find.

raw_data_tabl has about 17 million voters in it.

any direction is super appreciated, thanks in advance.


Solution

  • You can try this -

    library(stringr)
    library(tidyverse)
    
    #read input file
    txt <- readLines("test.txt")
    
    #put delimiter between columns and transform it into a dataframe
    txt <- gsub("\\s+(.*)", ",\\1", txt)
    df <- read.table(textConnection(txt), 
                    header = T, stringsAsFactors = F, sep = ",", colClasses = c("ID" = "character"))
    

    Initial dataframe looks like

    > df
    #    ID                                                         Voter_History
    #1 0001 GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
    #2 0002                           2016 GENERAL ELECTION;2014 GENERAL ELECTION
    #3 0003                         20121106 GE;20081104 GE;General Election 2006
    #4 0004                                                           GE 20150910
    #5 0005                              16 GENERAL ELECTION; 14 PRIMARY ELECTION
    

    Clean Voter_History column's data to extract useful information

    election_func <- function(x){
      #extract year
      yr <- gsub("20", "", substr(str_extract_all(strsplit(x, split=";")[[1]], "[0-9]+"), 1, 4))
      #extract election type
      elec_type <- toupper(substr(str_extract(strsplit(x, split=";")[[1]], '[A-Za-z]+'), 1, 2))
    
      return(paste(sort(paste(elec_type, yr, sep="_")), collapse = ";"))
      }
    
    df$Voter_History <- do.call(rbind, lapply(df$Voter_History, function(x) election_func(x)))
    

    Cleaned data is

    > df
    #    ID                 Voter_History
    #1 0001 GE_04;GE_08;GE_12;GE_16;PP_08
    #2 0002                   GE_14;GE_16
    #3 0003             GE_06;GE_08;GE_12
    #4 0004                         GE_15
    #5 0005                   GE_16;PR_14
    

    Finally transform this data in the desired format

    df1 <- df %>%
      separate_rows("Voter_History", sep= ";") %>%
      distinct(ID, Voter_History) %>%
      mutate(value = 1) %>%
      spread(Voter_History, value, fill = 0)
    
    df1
    #    ID GE_04 GE_06 GE_08 GE_12 GE_14 GE_15 GE_16 PP_08 PR_14
    #1 0001     1     0     1     1     0     0     1     1     0
    #2 0002     0     0     0     0     1     0     1     0     0
    #3 0003     0     1     1     1     0     0     0     0     0
    #4 0004     0     0     0     0     0     1     0     0     0
    #5 0005     0     0     0     0     0     0     1     0     1
    


    Sample data: test.txt contains

    ID     Voter_History  
    0001   GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
    0002   2016 GENERAL ELECTION;2014 GENERAL ELECTION
    0003   20121106 GE;20081104 GE;General Election 2006
    0004   GE 20150910
    0005   16 GENERAL ELECTION; 14 PRIMARY ELECTION
    


    (Update - added a logic to resolve Error: Duplicate identifiers for rows.... This was happening because of duplicate ID & Voter_History combination in spread call)