Search code examples
rcsvimportnul

'embedded nul in string' error trying to import multiple .csv files en masse from different subdirectories in R


I have a large number of csv data files that are located in many different subdirectories. The files all have the same name and are differentiated by the subdirectory name.

I'm trying to find a way to import them all into r in such a way that the subdirectory name for each file populates a column in the datafile.

I have generated a list of the files using list.files(), which I've called tto_refs.

head(tto_refs) 1 "210119/210115 2021-01-19 16-28-14/REF TTO-210119.D/REPORT01.CSV" "210122/210115 2021-01-22 14-49-41/REF TTO-210122.D/REPORT01.CSV"
[3] "210127/210127 2021-01-27 09-39-15/REF TTO-210127_1.D/REPORT01.CSV" "210127/210127 2021-01-27 09-39-15/REF TTO-210127_2.D/REPORT01.CSV"
[5] "210127A/210127 2021-01-28 15-57-40/REF TTO-210127A_1.D/REPORT01.CSV" "210127A/210127 2021-01-28 15-57-40/REF TTO-210127A_2.D/REPORT01.CSV"

I tried a few different methods to import the data into r, but they all had errors related to 'embedded nul(s)'.

For example, tbl <- tto_refs %>% map_df(~read.csv(.))

There were 50 or more warnings (use warnings() to see the first 50)

warnings() Warning messages: 1: In read.table(file = file, header = header, sep = sep, ... : line 1 appears to contain embedded nulls 2: In read.table(file = file, header = header, sep = sep, ... : line 2 appears to contain embedded nulls

etc.

How can I get this data into R?

Edit: the .csv files are generated from Agilent Chemstation analytical software.

The data looks like this: enter image description here


Solution

  • Your files are in the UTF-16 (or UCS-2) character encoding. This means that each character is represented by two bytes. Because the data only contain ASCII characters, the second byte of each character is 0.

    Because R is expecting a single-byte-per-character encoding, it thinks the second byte is meant to be a null character, which should not be present in a CSV file.

    In addition the files contain a byte-order-mark at the start of the first line, which is being converted to garbage. You need a UTF-16 to UTF-8 converter program. This should also remove the byte order mark (which is not required in UTF-8).

    Personally I would do this using the tool iconv. If I were using Windows I would use Cygwin to install it.

    for f in *.CSV
      do iconv -f UTF-16 -t UTF-8 <"$f" >"${f%.CSV}-utf8.csv"
    done
    

    If you don't like this approach there are several other tools listed as answers to this question.