Search code examples
rcsvreadrdata-quality

How to process multiple csv files for identifying null values in R?


I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. The code works well for a single csv file. But I want to run the code for all the csv files and need output for each csv file. Additionally, I want a log file. Could anyone please help me by modifying the code how it can be used to process various csv files.

install.packages("readr") 
library(readr)

check_column <- function(df, column) {
  valid_values <- !is.na(df[[column]])
  num_valid <- sum(valid_values)
  num_null <- nrow(df) - num_valid
  return(c(num_valid, num_null))
}

#Read the CSV file
df <- read_csv("data.csv")

for (column in names(df)) {
  results <- check_column(df, column)
  print(paste(column, ": ", results[1], " valid, ", results[2], " null"))
}

Sample data: (Not all files have same number of columns)

Csv1.csv

D_T  Temp (°C)  Press (Pa)  ...
2021-03-01 00:00:00+00  28  1018  ...
2021-03-02 00:00:00+00  27  1017  ...
2021-03-03 00:00:00+00  28  1019  ...
..
.. 

Csv2.csv

D_T  Temp (°C)  Vel (m/s)  Press (Pa_...
2022-03-01 00:00:00+00  28  118  1018  ...
2022-03-02 00:00:00+00  27  117  1019  ...
2022-03-03 00:00:00+00  28  119  1018  ...
..
.. 

Solution

  • How about something like this? This will not store anything in a variable. Let me know if you need help with it.

    library(readr)    
        
    for(files in list.files(pattern=".*csv$")) {
        file <- read_csv(files)
        out <- file(paste0(files, ".log"), open = "w")
        sapply(colnames(file), function(x) {
                cat(
                        paste0(x, ":"),
                        sum(!is.na(file[, x])),
                        "valid,",
                        sum(is.na(file[, x])),
                        "null\n",
                        file = out
                )
        })
        close(out)
    }
    

    To write into one file only:

    library(readr)    
    
    out <- file("output.log", open = "w")
    for(files in list.files(pattern=".*csv$")) {
            file <- read_csv(files)
            cat(files, "\n", file = out)
            sapply(colnames(file), function(x) {
                    cat(
                            paste0(x, ":"),
                            sum(!is.na(file[, x])),
                            "valid,",
                            sum(is.na(file[, x])),
                            "null\n",
                            file = out
                    )
            })   
    }
    close(out)