Search code examples
rreplacenanumericnaniar

Replace # with NA and convert values into numerics in R


I have a problem with my dataframe. The missing values are marked with # and I cant find a way to automatically replace them with NA.

Here is my dataframe: https://gofile.io/?c=BfpgbC

This is what I have tried:

library(naniar)
df_new= testframe %>% replace_with_na(replace = list(NO2_Königsplatz = "#"))

testframe[testframe== "#"] <- NA

Both does not work. When I manually replace each value, it works but that is not an option because it takes too long.

After replacing the missing values with NA I want to convert all columns (not the 1st column) into numerics to calculate the means.

Any ideas how to solve this?


Solution

  • EDIT WITH CORRECT DATA

    Here is a second approach:

    • read the data "as is"
    • convert the date-time column from character to datetime
    • assume that any non-numeric entry in the other columns = NA
    • convert all other columns from character to integer

    The last step will generate warnings about coerced NA values, which can be ignored. We can use the lubridate and dplyr packages:

    library(dplyr)
    library(lubridate)
    
    dat <- read.table("AUG-2017-Air.dat", 
                      stringsAsFactors=FALSE) %>% 
      mutate(Zeitpunkt = dmy_hm(Zeitpunkt)) %>% 
      mutate_if(is.character, as.integer)
    

    Note that the timezone is assumed UTC unless specified otherwise.

    Result:

    str(dat)
    
    'data.frame':   8760 obs. of  13 variables:
     $ Zeitpunkt        : POSIXct, format: "2017-01-01 01:00:00" "2017-01-01 02:00:00" "2017-01-01 03:00:00" "2017-01-01 04:00:00" ...
     $ NO2_Bourgesplatz : int  31 31 29 30 29 28 27 29 28 25 ...
     $ NO2_Karlstraße   : int  34 35 31 31 31 32 38 35 33 29 ...
     $ NO2_Königsplatz  : int  29 29 28 28 27 27 26 28 28 23 ...
     $ NO2_LfU          : int  31 31 29 28 27 26 25 23 22 24 ...
     $ O3_Bourgesplatz  : int  6 5 2 2 2 2 2 2 2 8 ...
     $ O3_LfU           : int  4 3 3 3 3 3 3 3 3 5 ...
     $ PM10_Bourgesplatz: int  455 417 106 90 87 93 85 79 91 77 ...
     $ PM10_Karlstraße  : int  203 75 58 53 55 60 51 45 48 48 ...
     $ PM10_Königsplatz : int  215 75 62 51 64 63 69 51 50 64 ...
     $ PM10_LfU         : int  376 321 62 101 61 112 123 112 118 147 ...
     $ PM25_Bourgesplatz: int  267 308 100 87 87 94 88 80 92 79 ...
     $ PM25_LfU         : int  160 112 48 62 50 65 62 66 65 76 ...
    

    EDIT WITH BASE R SOLUTION

    dat <- read.table("AUG-2017-Air.dat", 
                      stringsAsFactors=FALSE)
    
    dat[2:13] <- lapply(dat[2:13], as.numeric)
    

    OLD ANSWER WITH INCORRECT DATA

    You can specify which values represent missing values when you import the data to R. In general if unsure about data, it's best to read "as is", explore and then figure out the quirks of that particular dataset, then go back and fix it up.

    For the data linked in your question, this should work:

    testframe <- read.table("testframe.dat", 
                            sep = "", 
                            na.strings = c("    # ", 
                                           "   -"), 
                            stringsAsFactors=FALSE)
    

    Result:

    str(testframe)
    
    'data.frame':   1095 obs. of  13 variables:
     $ Zeitpunkt        : chr  "01.01.2017 07:00" "01.01.2017 14:00" "01.01.2017 21:00" "02.01.2017 07:00" ...
     $ NO2_Bourgesplatz : int  27 22 41 22 20 36 35 24 23 12 ...
     $ NO2_Karlstraße   : int  38 49 53 38 54 45 47 61 32 19 ...
     $ NO2_Königsplatz  : int  26 25 46 26 35 30 33 31 19 8 ...
     $ NO2_LfU          : int  25 16 36 19 13 26 24 17 18 9 ...
     $ O3_Bourgesplatz  : int  2 22 2 23 32 12 8 41 36 55 ...
     $ O3_LfU           : int  3 31 4 28 48 22 23 55 47 67 ...
     $ PM10_Bourgesplatz: int  85 54 74 32 18 11 18 11 13 6 ...
     $ PM10_Karlstraße  : int  51 37 69 34 21 15 23 17 17 11 ...
     $ PM10_Königsplatz : int  69 36 59 26 21 12 18 15 13 8 ...
     $ PM10_LfU         : int  123 35 68 30 12 6 16 11 14 6 ...
     $ PM25_Bourgesplatz: int  88 61 53 32 19 12 15 11 11 6 ...
     $ PM25_LfU         : int  62 42 61 24 15 9 15 11 12 5 ...
    

    You might also consider conversion of the Zeitpunkt column to a datetime class, depending on what you want to do next.