Search code examples
runicodeasciiiconvfixed-width

coerce single-byte ascii from a text file


I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:

invalid multibyte string at 'NTY <20> MAINE
000008 [...]

That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)

I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?

Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?

Any help much appreciated!


Solution

  • What does the output of the file command say about your data file?

    /tmp >file a.txt b.txt 
    a.txt: UTF-8 Unicode text, with LF, NEL line terminators
    b.txt: ASCII text, with LF, NEL line terminators
    

    You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:

    # \x{93} and \x{94} are Windows 1252 quotes
    /tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt 
    /tmp >file a.txt
    a.txt: Non-ISO extended-ASCII text
    /tmp >cat a.txt 
    He said, ?hello!?
    

    Now, with iconv you can try to convert it to ascii:

    /tmp >iconv -f windows-1252 -t ascii a.txt 
    He said, 
    iconv: a.txt:1:9: cannot convert
    

    Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:

    /tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt  > converted.txt
    /tmp >file converted.txt
    converted.txt: ASCII text
    /tmp >cat converted.txt 
    He said, "hello!"
    

    There might be a way to do this using R's IO layer, but I don't know R.

    Hope that helps.