Search code examples
rdplyrreadr

Tidy dataframes with special characters


My dataset in csv format (available here) is as follows: enter image description here

I tried to import the data into R with the following document:

library(tidyverse)
data <- read_csv("<Path to file>\\Sample.csv")

When I tried to extract column with data$, I get the error:

Error in gsub(reStrip, "", completions, perl = TRUE) : 
  input string 4 is invalid UTF-8

Column D of my data contains both numbers and special characters * and the Pilcrow symbol. The number in the last last row for this column is 66.1 but is followed by #. Similarly, column I contains * and the Pilcrow symbol.

I want to clean the data so that for Column D and I, rows with * and the Pilcrow symbol are replaced with NA. In addition, "66.1 #" is converted to "66.1".

My desired table is as below: enter image description here

I would like to know how I could import the csv file into R and clean it to the desired format in R. Preferable with tidyverse.


Solution

  • Here's a simple solution.

    Read-in the file as a .txt file:

    Edit 1: As the data contains the hashtag # in some cells, you need to include the argument comment.char="" so that R will display the data following it (it will however still treat it as the onset of a comment):

    df <- read.table("[Your path]/Sample.txt", 
                       header = T, fill = T, quote="", sep="\t", comment.char="")
    

    As others, I find your column headers very unpractical and have taken the liberty to change them to the first eleven letters of the alphabet (merely for visual reasons; you will have reasons to retain your headers):

    colnames(df) <- LETTERS[1:11]
    df
                                        A     B    C       D     E     F           G       H     I      J    K
    1       Adams County, Nebraska(6,10)  31001 ***    62.1  51.4  74.6          25  stable  -5.2  -32.3  32.8
    2   Adams County, North Dakota(6,10)  38001 ***       *      *    *  3 or fewer       *     *      *     *
    3 Aiken County, South Carolina(6,10)  45003 ***    55.9  51.5  60.6         124  stable  -2.3   -8.8   4.6
    4        Aitkin County, Minnesota(6)  27001  ***      ¶     ¶     ¶           ¶       ¶     ¶      ¶     ¶
    5   Albemarle County, Virginia(6,10)  51003 ***    49.4  44.1  55.3          64  stable  -2.9  -18.1  15.2
    6      Alcona County, Michigan(6,10)  26001 ***  66.1 #  51.2  86.7          17  stable  -3.6  -26.5  26.3
    

    The desired transformations can be done in many ways; one simple way is this (transforms the two columns separately):

    df[,4] <- gsub("\\*|¶", NA, df[,4])
    df[,9] <- gsub("\\*|¶", NA, df[,9])
    

    A slightly more complex but more economical way is this (transforms the two columns in one go):

    df[,c(4,9)] <- lapply(df[,c(4,9)], function(x) gsub("\\*|¶", NA, x))
    

    Edit 2: If you want to get rid of the hashtag in the data, this will do it for column D:

    df[,4] <- gsub(" #", "", df[,4])
    

    If you want to get rid of # in a range of columns, subset dfon that range and use lapply:

    df[,c([range])] <- lapply(df[,c([range])], function(x) gsub(" #", "", x))
    

    The output is this:

    df
                                        A     B    C     D     E     F           G       H     I      J    K
    1       Adams County, Nebraska(6,10)  31001 ***  62.1  51.4  74.6          25  stable  -5.2  -32.3  32.8
    2   Adams County, North Dakota(6,10)  38001 ***   <NA>     *    *  3 or fewer       *   <NA>     *     *
    3 Aiken County, South Carolina(6,10)  45003 ***  55.9  51.5  60.6         124  stable  -2.3   -8.8   4.6
    4        Aitkin County, Minnesota(6)  27001  ***  <NA>    ¶     ¶           ¶       ¶   <NA>     ¶     ¶
    5   Albemarle County, Virginia(6,10)  51003 ***  49.4  44.1  55.3          64  stable  -2.9  -18.1  15.2
    6      Alcona County, Michigan(6,10)  26001 ***  66.1  51.2  86.7          17  stable  -3.6  -26.5  26.3