Search code examples
rutf-8byte-order-mark

R's read.csv prepending 1st column name with junk text


I have exported data from a result grid in SQL Server Management Studio to a csv file. The csv file looks correct.

But when I read the data into an R dataframe using read.csv, the first column name is prepended with "ï..". How do I get rid of this junk text?

Example:

str(trainData)

'data.frame':   64169 obs. of  20 variables:    
 $ ï..Column1             : int  3232...   
 $ Column2                : int  4242...

The data looks something like this (nothing special) :

Column1,Column2
100116577,100116577
100116698,100116702


Solution

  • You've got a Unicode UTF-8 BOM at the start of the file:

    http://en.wikipedia.org/wiki/Byte_order_mark

    A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this

    R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.

    Here:

    https://stat.ethz.ch/pipermail/r-help/2014-February/370760.html

    Duncan Murdoch suggests:

    You can declare a file to be in encoding "UTF-8-BOM" if you want to ignore a BOM on input

    So try your read.csv with fileEncoding="UTF-8-BOM" or persuade your SQL wotsit to not output a BOM.

    Otherwise you may as well test if the first name starts with ï.. and strip it with substr (as long as you know you'll never have a column that does start like that genuinely...)