Search code examples
regexrcsvimport-from-csv

Importing CSV data containing commas, thousand separators and trailing minus sign


R 2.13.1 on Mac OS X. I'm trying to import a data file that has a point for thousand separator and comma as the decimal point, as well as trailing minus for negative values.

Basically, I'm trying to convert from:

"A|324,80|1.324,80|35,80-"

to

  V1    V2     V3    V4
1  A 324.80 1324.8 -35.80

Now, interactively both the following works:

gsub("\\.","","1.324,80")
[1] "1324,80"

gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"

and also combining them:

gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"

However, I'm not able to remove the thousand separator from read.data:

setClass("num.with.commas")

setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion

mytable
  V1 V2 V3 V4
1  A NA NA NA

Note that if I change from "\\." to "," in the function, things look a bit different:

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

mytable
  V1    V2     V3    V4
1  A 32480 1.3248 -3580

I think the problem is that read.data with dec="," converts the incoming "," to "." BEFORE calling as(from, "num.with.commas"), so that the input string can be e.g. "1.324.80".

I want as("1.123,80-","num.with.commas") to return -1123.80 and as("1.100.123,80", "num.with.commas") to return 1100123.80.

How can I make my num.with.commas replace all except the last decimal point in the input string?

Update: First, I added negative lookahead and got as() working in the console:

setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8

However, read.table still had the same problem. Adding some print()s to my function showed that num.with.commas in fact got the comma and not the point.

So my current solution is to then replace from "," to "." in num.with.commas.

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
  V1    V2      V3    V4
1  A 324.8 1101325 -35.8

Solution

  • You should be removing all the periods first and then changing the commas to decimal points before coercing with as.numeric(). You can later control how decimal points are printed with options(OutDec=",") . I do not think R uses commas as decimal separators internally even in locales where they are conventional.

    > tst <- c("A","324,80","1.324,80","35,80-")
    > 
    > as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
    [1]     NA  324.8 1324.8  -35.8
    Warning message:
    NAs introduced by coercion