I am working with R to scrape and clean a data for my work as journalist. I could get the table of the HTML, then read it as dataframe and rename the columns' name. Now I am trying to create a new column which gets a value considering other column' values.
This new column should get the values of "Avante", "DEM", "MDB", "Patriota", "PCdoB" and so on. It´s the party of each deputy. Avante, for example, has three deputies, who are "Adalberto Cavalcanti", "Cabo Sabino" and "Silvio Costa". The name of the deputies always come below the whole row with the party´s name.
url <- "http://www.camara.leg.br/internet/votacao/mostraVotacao.asp?ideVotacao=8559&numLegislatura=55&codCasa=1&numSessaoLegislativa=4&indTipoSessaoLegislativa=O&numSessao=225&indTipoSessao=E&tipo=partido"
library(xml2)
library(rvest)
file <- read_html(url)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[3], fill = TRUE, header = T)
head(table1)
table1_df <- as.data.frame(table1)
colnames(table1_df) <- c("deputado", "uf", "voto")
This is what I have right now: enter image description here
This is what I want: enter image description here
Here's a solution that uses only base R:
url <- "http://www.camara.leg.br/internet/votacao/mostraVotacao.asp?ideVotacao=8559&numLegislatura=55&codCasa=1&numSessaoLegislativa=4&indTipoSessaoLegislativa=O&numSessao=225&indTipoSessao=E&tipo=partido"
library(xml2)
library(rvest)
file <- read_html(url)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[3], fill = TRUE, header = T)
head(table1)
table1_df <- as.data.frame(table1)
colnames(table1_df) <- c("deputado", "uf", "voto")
# create the new column for later
table1_df$new_column <- NA
# identify rows with the Total PARTY: NUM rows
idx <- grep("Total.*: \\d+", table1_df$deputado)
# Loop over these and assign the values
for (i in seq_along(idx)){
# Extract the number of deputados
n <- as.numeric(sub("^.*: ", "", table1_df$deputado[idx[i]]))
# Extract the party
partido <- sub("Total ", "", table1_df$deputado[idx[i]])
partido <- sub(": .*", "", partido)
# Assign the values
table1_df$new_column[(idx[i] - n):(idx[i] - 1)] <- partido
}
# Remove the unnecessary lines
table1_df <- table1_df[-grep("Total .*:.*", table1_df$deputado), ]
table1_df <- table1_df[-which(table1_df$deputado == table1_df$uf), ]