Search code examples
rstringcsvtypeof

R - String column is considered as integer


I encountered a very weird problem when reading a .csv file into a data frame newdata using read.csv.

One of the column is "Site", and it should be a string:

  • When I look at the data frame using View I see that it contains values such as "www.google.com","www.facebook.com" etc.
  • If I check what is the type of the column by typeof(newdata$Site), I get the result "integer".
  • If I check the frequency of appearance of each string using table(newdata$Site), and I write this table to a .csv file, I get a proper frequency table for each value, with additional numerical value (e.g. one column with no name with numerical values, one column named var1 with the sites strings (e.g. "www.google.com") and one column named Freq with the frequency).

I tried to create a new column which combines multiple values into one (e.g. "www.google.com" and "www.google.co.uk" into "Google") and I used grepl, then I realized that R treats the original column not as a string...

When I tried to subset this column only by a = newdata[,"Site"], I got that a is of type factor... writing it to .csv results in one long line of all the values....

What am I doing wrong???? I'm kind of new to these stuff and I really don't know what to do...

Thanks!!!


Solution

  • You have already dug quite a lot. You know that your column Site is a factor and it has typeof() integer.

    To avoid coding strings as factors when reading in data, use:

    read.csv(..., stringsAsFactors = FALSE)
    

    Factors are stored as integers, where integer gives the position of its levels. Try:

    x <- gl(3,2,labels=letters[1:3])
    #[1] a a b b c c
    #Levels: a b c
    
    typeof(x)
    #[1] "integer"
    
    levels(x)
    #[1] "a" "b" "c"
    
    levels(x)[x]   ## equivalent to "as.character(x)", but more efficient
    #[1] "a" "a" "b" "b" "c" "c"