Search code examples
rdataframechainingrvest

Reading in Data.Frames with Strings as factors = False in R using chain operator


I have a table source that reads into a data frame. I know that by default, external sources are read into data frames as factors. I'd like to apply stringsAsFactors=FALSE in the data frame call below, but it throws an error when I do this. Can I still use chaining and turn stringsAsFactors=FALSE?

library(rvest)
pvbData <- read_html(pvbURL)
pvbDF <- pvbData %>%
html_nodes(xpath = `//*[@id="ajax_result_table"]`) %>% 
html_table() %>% 
data.frame()

data.frame(,stringsAsFactors=FALSE)  <- Throws an error

I know this is probably something very simple, but I'm having trouble finding a way to make this work. Thank you for your help.


Solution

  • Though the statement should logically be data.frame(stringsAsFactors=FALSE) if you are applying chaining, even this statement doesn't produce the required output.

    The reason is misunderstanding of use of stringsAsFactors option. This option works only if you make the data.frame column by column. Example:

    a <- data.frame(x = c('a','b'),y=c(1,2),stringsAsFactors = T)
    str(a)
    
    'data.frame':   2 obs. of  2 variables:
     $ x: Factor w/ 2 levels "a","b": 1 2
     $ y: num  1 2
    
    a <- data.frame(x = c('a','b'),y=c(1,2),stringsAsFactors = F)
    str(a)
    
    'data.frame':   2 obs. of  2 variables:
     $ x: chr  "a" "b"
     $ y: num  1 2
    

    If you give data.frame as input, stringsAsFactors option doesn't work

    Solution:

    Store the chaining result to a variable like this:

    library(rvest)
    pvbData <- read_html(pvbURL)
    pvbDF <- pvbData %>%
    html_nodes(xpath = `//*[@id="ajax_result_table"]`) %>% 
    html_table()
    

    And then apply this command:

    data.frame(as.list(pvbDF),stringsAsFactors=F)
    

    Update:

    If the column is already a factor, then you can't convert it to character vector using this command. Better first as.character it and retry.

    You may refer to Change stringsAsFactors settings for data.frame for more details.