Search code examples
rsubsetdata-cleaninglevels

Identifying and correcting typos in a data set using subset


I have a data set: (can be seen from the below link) https://drive.google.com/file/d/0B4Mldbnr1-avMDIxYmZLSnRfUDA/view?usp=sharing and I want to make data correction using subset & levels function. Here is what I have been trying to apply but it does not seem to work:

# Setting working directory
setwd("F:/Intro Data Science/Assignment Part B/Assignment Part B-20170902")
plot.new()
options(digits=2)

# Reading data set
installed.packages("lubridate")
library(lubridate)

# Reading data set
power <- read.csv("data set 6.csv", na.strings="")

# SUBSETTING
Area <- as.numeric(power$Area)
City <- as.character(power$City)
P.Winter <- as.numeric(power$P.Winter)
P.Summer <- as.numeric(power$P.Summer)

#Data Cleaning
levels(power$City)<- c(levels(power$City),"Auckland")
power$City[power$City == "Ackland"] <- "Auckland"

I really need your help guys. This was supposed to be easy because I have followed exactly what was given in the lecture but it doesn't do anything when I run the code. Appreciate your help Nelson

The output requested:

> dput(head(power, 30))
structure(list(Area = c(144.38, 176.83, 268.71, 208.67, 123.61, 
199.3, 109.46, 183.28, 110.61, 146.91, 77.451, 232.65, 270.94, 
49.191, 234.5, 280.93, 192.18, 95.918, 230.74, 72.698, 129.26, 
110.76, 199.44, 129.75, 146.8, 287.97, 162.1, 249.03, 159.3, 
272.51), City = c("Auckland ", "Auckland ", "Auckland ", "Auckland ", 
"Auckland ", "Auckland ", "Auckland ", "Auckland ", "Auckland ", 
"Auckland ", "Auckland ", "Auckland ", "Auckland ", "Auckland ", 
"Auckland ", "Auckland ", "Auckland ", "Ackland ", "Auckland ", 
"Auckland ", "Auckland ", "Auckland ", "Auckland ", "Auckland ", 
"Auckland ", "Auckland ", "Auckland ", "Auckland ", "Auckland ", 
"Auckland "), P.Winter = c(1684.9, 1926.7, 2026.9, 1938.1, 1579.9, 
1991.4, 1572.5, 1691.2, 1684.2, 1743.6, 1234.6, 2043, 1986.7, 
1259.7, 1870.4, 2115.6, 18000, 1452, 1936.2, 1430.2, 1587.3, 
1614.3, 1993.2, 1746.4, 1807.6, 2009.4, 1859.1, 1985.5, 1909.4, 
1892.7), P.Summer = c(1194.5, 1487.3, 1737.3, -158, 1148.1, 1445.8, 
885.77, 1393, 1191.5, 1149.9, 813.38, 1623.8, 1708, 874.48, 1635.7, 
1826.1, 1596.6, 793.71, 1668.8, 905.6, 1227.3, 938.38, 1523.1, 
1012.6, 1122.8, 1829.5, 1223.3, 1653.2, 1175.5, 1882)), .Names = c("Area", 
"City", "P.Winter", "P.Summer"), row.names = c(NA, 30L), class = "data.frame")

Solution

  • I believe that the function you want is droplevels.
    First, make up some data.

    set.seed(5295)    # make the results reproducible
    cities <- factor(sample(c("Ackland", "Auckland", "Wellington", "Sidney"), 100, TRUE))
    power <- data.frame(City = cities)
    

    Now the code, starting with yours.

    power$City[power$City == "Ackland"] <- "Auckland"
    power$City <- droplevels(power$City)
    
    levels(power$City)    # check if it worked
    #[1] "Auckland"   "Sidney"     "Wellington"
    

    EDIT.
    After seen the output of dput(head(power, 30)), the solution became onvious. The column City is of class character, not factor, and there are no values "Ackland" or "Auckland", they have a trailing white space that is messing things up. So all we need to do is to remove "Ackland " and remove the trailing white spaces.

    str(power)
    #'data.frame':   30 obs. of  4 variables:
    # $ Area    : num  144 177 269 209 124 ...
    # $ City    : chr  "Auckland " "Auckland " "Auckland " "Auckland " ...
    # $ P.Winter: num  1685 1927 2027 1938 1580 ...
    # $ P.Summer: num  1194 1487 1737 -158 1148 ...
    
    which(power$City == "Ackland ")    # note the white space
    #[1] 18
    
    which(power$City == "Auckland ")    # note the white space
    # [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 19 20 21 22 23 24 25 26
    #[26] 27 28 29 30
    
    # remove the value "Ackland ", with white space
    power$City[power$City == "Ackland "] <- "Auckland"
    power$City <- trimws(power$City)    # remove white spaces from all of them
    

    And no columns vanish, just run str(power) to see it.