UPDATE
I imported a database from CSV file using the following command:
data.CGS <- read.csv("filepath", sep=";", na.strings=c(""," ","NA"), stringsAsFactors =F)
One column in the CSV file has different types of data, numerical, integers, percentages and characters strings.
Say, for simplicity, that this column has the following elements col=[1,2,1, c, 2%, 4%, 15.5, 16.5]
So, in R will read this column as if one created this variable
col<-c("1","2", "c", "2%", "4%", "15.5", "16.5", "1980", "1/12/1950")
My purpose is to do some tabulations and compute some statistics based on the "truly" numerical data, which in this example are all values except the letter "c" and the dates, 1980 and 1/12/1950.
What is the easiest way to do this in R ? Any help will be much appreciated.
Of course, there is the very simple thing to do, which is to coerce all elements to be numeric, but then in R this implies convert all characters into NA - which I do not like.
One way is to create a new vector that is separate from any text characters.
## Create new vector without any characters
col2 <- col[-grep("[a-zA-Z]", col)]
## To strip percentages (%)
strip_percents <- as.numeric(gsub("%", "", col2))
## All numbers except percentages
no_percents <- as.numeric(col2[-grep("%", col2)])
## Save strings in new vector
all_yo_strings <- col[grep("[a-zA-Z]", col)]
## Save percentages in a new vector
all_yo_percents <- col[grep("%", col)]
all_yo_percents <- as.numeric(gsub("%", "", all_yo_percents))/100
Does that work for your purposes? It will preserve your text strings in the original col
variable (which you can access by simply removing the -
from col[-grep("[a-zA-Z]", col)]
), while giving you a new, numeric vector.