Search code examples
htmlrtidyrdata-cleaningreshape2

Clean HTML table with Reshape2


New user of R. Can't think how to even ask the question. I scraped a webpage for HTML tables. Generally, everything went well, except for one table. Instead of there being 7 separate tables, everything got collapsed into 1 table, with the column name and the value for the first table being two separate columns, and all the other tables being rows. The results is a table with something like this:

df <- data.frame(is_employed = c("Hobbies", "Has Previous Experience"), false = c("squash", "false"))

Obviously, I need to have the rows (and the column name) in the first column as their own columns, with the item in the second column as their values, preferably with underscores in the columns names. I tried:

df <- dcast(df, ~is_employed, value.var = "false")

But got an error message. Then I thought to add another column, as such:

df2 <- data.frame(number = c(1, 2), is_employed = c("Hobbies", "Has Previous Experience"), false = c("squash", "false"))

then I tried

df3 <- dcast(df2, number ~is_employed, value.var="false")

That placed the values in the first columns as their own columns, but produced two rows (instead of 1), with NAs. I'm sure this is really basic, but I can't figure it out.

On edit: I think this gives me what I want, but I'm away from my computer so I can't confirm:

library("dplyr")
library("tidyr")

mat <- as.matrix(df)
mat <- rbind(colnames(mat), mat)
colnames(mat) <- c("variable", "value")


df2 <- as.data.frame(mat)

df3 <- df2 %>%
   mutate(n = 1) %>%
   spread(variable, value) %>%
   select(-n)

I need to add n or I get NAs, but I don't like it.


Solution

  • Is this what you're after?

    mat <- as.matrix(df)
    mat <- rbind(colnames(mat), mat)
    colnames(mat) <- c("variable", "value")
    mat
    #      variable                  value   
    # [1,] "is_employed"             "false" 
    # [2,] "Hobbies"                 "squash"
    # [3,] "Has Previous Experience" "false" 
    
    as.data.frame(mat)
    #                  variable  value
    # 1             is_employed  false
    # 2                 Hobbies squash
    # 3 Has Previous Experience  false