Search code examples
rdplyrtidyverservestdata-cleaning

How to put values inside a column based on other column values in R


I am working with R to scrape and clean a data for my work as journalist. I could get the table of the HTML, then read it as dataframe and rename the columns' name. Now I am trying to create a new column which gets a value considering other column' values.

This new column should get the values of "Avante", "DEM", "MDB", "Patriota", "PCdoB" and so on. It´s the party of each deputy. Avante, for example, has three deputies, who are "Adalberto Cavalcanti", "Cabo Sabino" and "Silvio Costa". The name of the deputies always come below the whole row with the party´s name.

url <- "http://www.camara.leg.br/internet/votacao/mostraVotacao.asp?ideVotacao=8559&numLegislatura=55&codCasa=1&numSessaoLegislativa=4&indTipoSessaoLegislativa=O&numSessao=225&indTipoSessao=E&tipo=partido"

library(xml2)
library(rvest)
file <- read_html(url)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[3], fill = TRUE, header = T)

head(table1)

table1_df <- as.data.frame(table1)

colnames(table1_df) <- c("deputado", "uf", "voto")

This is what I have right now: enter image description here

This is what I want: enter image description here


Solution

  • Here's a solution that uses only base R:

    url <- "http://www.camara.leg.br/internet/votacao/mostraVotacao.asp?ideVotacao=8559&numLegislatura=55&codCasa=1&numSessaoLegislativa=4&indTipoSessaoLegislativa=O&numSessao=225&indTipoSessao=E&tipo=partido"
    
    library(xml2)
    library(rvest)
    file <- read_html(url)
    tables <- html_nodes(file, "table")
    table1 <- html_table(tables[3], fill = TRUE, header = T)
    
    head(table1)
    
    table1_df <- as.data.frame(table1)
    
    colnames(table1_df) <- c("deputado", "uf", "voto")
    
    # create the new column for later
    table1_df$new_column <- NA
    
    # identify rows with the Total PARTY: NUM rows
    idx <- grep("Total.*: \\d+", table1_df$deputado)
    
    # Loop over these and assign the values
    for (i in seq_along(idx)){
      # Extract the number of deputados
      n <- as.numeric(sub("^.*: ", "", table1_df$deputado[idx[i]]))
      # Extract the party
      partido <- sub("Total ", "", table1_df$deputado[idx[i]])
      partido <- sub(": .*", "", partido)
      # Assign the values
      table1_df$new_column[(idx[i] - n):(idx[i] - 1)] <- partido
    }
    
    # Remove the unnecessary lines
    table1_df <- table1_df[-grep("Total .*:.*", table1_df$deputado), ]
    table1_df <- table1_df[-which(table1_df$deputado == table1_df$uf), ]