My dataset in csv format (available here) is as follows:
I tried to import the data into R with the following document:
library(tidyverse)
data <- read_csv("<Path to file>\\Sample.csv")
When I tried to extract column with data$
, I get the error:
Error in gsub(reStrip, "", completions, perl = TRUE) :
input string 4 is invalid UTF-8
Column D of my data contains both numbers and special characters * and the Pilcrow symbol. The number in the last last row for this column is 66.1 but is followed by #. Similarly, column I contains * and the Pilcrow symbol.
I want to clean the data so that for Column D and I, rows with * and the Pilcrow symbol are replaced with NA. In addition, "66.1 #" is converted to "66.1".
I would like to know how I could import the csv file into R and clean it to the desired format in R. Preferable with tidyverse.
Here's a simple solution.
Read-in the file as a .txt file:
Edit 1:
As the data contains the hashtag # in some cells, you need to include the argument comment.char=""
so that R will display the data following it (it will however still treat it as the onset of a comment):
df <- read.table("[Your path]/Sample.txt",
header = T, fill = T, quote="", sep="\t", comment.char="")
As others, I find your column headers very unpractical and have taken the liberty to change them to the first eleven letters of the alphabet (merely for visual reasons; you will have reasons to retain your headers):
colnames(df) <- LETTERS[1:11]
df
A B C D E F G H I J K
1 Adams County, Nebraska(6,10) 31001 *** 62.1 51.4 74.6 25 stable -5.2 -32.3 32.8
2 Adams County, North Dakota(6,10) 38001 *** * * * 3 or fewer * * * *
3 Aiken County, South Carolina(6,10) 45003 *** 55.9 51.5 60.6 124 stable -2.3 -8.8 4.6
4 Aitkin County, Minnesota(6) 27001 *** ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶
5 Albemarle County, Virginia(6,10) 51003 *** 49.4 44.1 55.3 64 stable -2.9 -18.1 15.2
6 Alcona County, Michigan(6,10) 26001 *** 66.1 # 51.2 86.7 17 stable -3.6 -26.5 26.3
The desired transformations can be done in many ways; one simple way is this (transforms the two columns separately):
df[,4] <- gsub("\\*|¶", NA, df[,4])
df[,9] <- gsub("\\*|¶", NA, df[,9])
A slightly more complex but more economical way is this (transforms the two columns in one go):
df[,c(4,9)] <- lapply(df[,c(4,9)], function(x) gsub("\\*|¶", NA, x))
Edit 2:
If you want to get rid of the hashtag in the data, this will do it for column D
:
df[,4] <- gsub(" #", "", df[,4])
If you want to get rid of # in a range of columns, subset df
on that range and use lapply
:
df[,c([range])] <- lapply(df[,c([range])], function(x) gsub(" #", "", x))
The output is this:
df
A B C D E F G H I J K
1 Adams County, Nebraska(6,10) 31001 *** 62.1 51.4 74.6 25 stable -5.2 -32.3 32.8
2 Adams County, North Dakota(6,10) 38001 *** <NA> * * 3 or fewer * <NA> * *
3 Aiken County, South Carolina(6,10) 45003 *** 55.9 51.5 60.6 124 stable -2.3 -8.8 4.6
4 Aitkin County, Minnesota(6) 27001 *** <NA> ¶ ¶ ¶ ¶ <NA> ¶ ¶
5 Albemarle County, Virginia(6,10) 51003 *** 49.4 44.1 55.3 64 stable -2.9 -18.1 15.2
6 Alcona County, Michigan(6,10) 26001 *** 66.1 51.2 86.7 17 stable -3.6 -26.5 26.3