Search code examples
rcsvdata-cleaning

variable of mixed types in R


UPDATE

I imported a database from CSV file using the following command:

data.CGS <- read.csv("filepath", sep=";", na.strings=c(""," ","NA"), stringsAsFactors =F)

One column in the CSV file has different types of data, numerical, integers, percentages and characters strings.

Say, for simplicity, that this column has the following elements col=[1,2,1, c, 2%, 4%, 15.5, 16.5]

So, in R will read this column as if one created this variable

col<-c("1","2", "c", "2%", "4%", "15.5", "16.5", "1980", "1/12/1950")

My purpose is to do some tabulations and compute some statistics based on the "truly" numerical data, which in this example are all values except the letter "c" and the dates, 1980 and 1/12/1950.

What is the easiest way to do this in R ? Any help will be much appreciated.

Of course, there is the very simple thing to do, which is to coerce all elements to be numeric, but then in R this implies convert all characters into NA - which I do not like.


Solution

  • One way is to create a new vector that is separate from any text characters.

    ## Create new vector without any characters
    col2 <- col[-grep("[a-zA-Z]", col)]
    
    ## To strip percentages (%)
    strip_percents <- as.numeric(gsub("%", "", col2))
    
    ## All numbers except percentages
    no_percents <- as.numeric(col2[-grep("%", col2)])
    
    ## Save strings in new vector
    all_yo_strings <- col[grep("[a-zA-Z]", col)]
    
    ## Save percentages in a new vector
    all_yo_percents <- col[grep("%", col)]
    all_yo_percents <- as.numeric(gsub("%", "", all_yo_percents))/100
    

    Does that work for your purposes? It will preserve your text strings in the original col variable (which you can access by simply removing the - from col[-grep("[a-zA-Z]", col)]), while giving you a new, numeric vector.