Search code examples
rdata-cleaning

How do I delete a string that's not a row or column


I'm working in R and I have a file where there is a string in the first row of the .csv, a blank row, and then my date.

I tried to remove this by doing:

df <- df[-c(1,2), ]

Instead of removing this string and the empty row, it removes the first two rows of the data. This indicates to me that R is disregarding this as part of my dataset.

I would like to remove it, but lack the vocabulary as to what this data is so that I can figure out how to remove it. I don't want to remove anything that might indicate important information like:

# A tibble: 0 x 5
# ... with 5 variables: row <int>, col <int>, expected <chr>,
#   actual <chr>, file <chr>

So my assumption is to not delete the header.

Here is the output when I type head(df):

# A tibble: 6 x 2
  `Category: All categories` ...2               
  <chr>                      <chr>              
1 NA                         NA                 
2 Month                      st: (United States)
3 2004-01                    25                 
4 2004-02                    21                 
5 2004-03                    20                 
6 2004-04                    24    

Here is the file, if that helps. It is output from Google Trends.

Thank you, I'm new at this, but really trying to learn it. :)


Solution

    1. Your use of -c(0,1) suggests that you believe R's indexing to be 0-based (similar to python and C); it is not, R's indexing is 1-based.

    2. Most (all?) CSV-reading functions have a skip= argument:

      head(read.csv("2004_present_st.csv", skip=2))
      #     Month st...United.States.
      # 1 2004-01                  25
      # 2 2004-02                  21
      # 3 2004-03                  20
      # 4 2004-04                  24
      # 5 2004-05                  24
      # 6 2004-06                  24
      
      ### all verified to work similarly
      readr::read_csv("2004_present_st.csv", skip=2)
      data.table::fread("2004_present_st.csv", skip=2)
      vroom::vroom("2004_present_st.csv",skip=2)
      
    3. In case you've already read in the data and need to repair it, though ...

      df <- read.csv("2004_present_st.csv")
      nms <- unlist(df[2,])
      df <- df[-(1:2),]
      names(df) <- nms
      df[] <- lapply(df, type.convert, as.is = TRUE)
      head(df)
      #     Month st: (United States)
      # 3 2004-01                  25
      # 4 2004-02                  21
      # 5 2004-03                  20
      # 6 2004-04                  24
      # 7 2004-05                  24
      # 8 2004-06                  24
      
      str(df)
      # 'data.frame': 217 obs. of  2 variables:
      #  $ Month              : chr  "2004-01" "2004-02" "2004-03" "2004-04" ...
      #  $ st: (United States): int  25 21 20 24 24 24 23 21 20 23 ...
      

      Notes on this cleanup:

      • I used -(1:2) instead of -c(1,2), mostly the same.
      • df[] <- lapply(df, ...) is a way to iterate a function over all columns and return them into place, preserving the class of "data.frame" for the df object. Had we done df <- lapply(df, ...) instead, df would now be a list; as.data.frame(lapply(df, ...)) would work, but I think df[] <- lapply(df, ...) looks/works better.
      • When you read in the data the first time, it's likely that due to the data in the top couple of rows, some or all of the columns might be misconstrued as strings. type.convert is used by read.csv (and other places) to try to convert strings into integers and numbers; as.is=TRUE prevents strings from being converted to factors, feel free to omit this if you prefer.