Search code examples
sqlrvertica

How to select all non-null columns?


I am sending the query to VERTICA(SQL) using R ODBC package. I am looking for a fast solution which will select all non-null columns from Table A and put them to new table B.

I have created and function which is in my opinion highly inefficient, but I don't know how to make it faster. The function takes max for every single column in a table and selects it if the MAX is not null. Putting everything into one query wouldn't change much because Vertica is a column database.

  reduce_width <- function(in_name, out_name) {
  query_in <- paste('SELECT * FROM', in_name, 'LIMIT 1;', sep = ' ')

vertica_answer <- dbGetQuery(conn, query_in)

var_names <- colnames(vertica_answer)

lenght <- length(var_names)

var_names <- as.data.frame(var_names, stringsAsFactors = F)

var_names$is_null <- 0

for(i in 1:lenght){
    name <- as.character(var_names[i,1])
    query_max <- paste('SELECT MAX(',name,') from', in_name,';', sep = ' ')
    max_value <- dbGetQuery(conn, query_max)
    if(is.na(max_value[1,1])){ 
      var_names$is_null[i]<-0
    } else {
      var_names $is_null[i]<-1
    }
  }

not_null_collumns <-var_names %>% filter(is_null==1) %>% select(var_names)

not_null_collumns <- (as.character(t(not_null_collumns$var_names)))

query_front <- 'SELECT'
  for(i in 1:(length(not_null_collumns)-1))  {
    query_front <- paste(query_front, not_null_collumns[i],',', sep = ' ')
  }

query_front <- paste(query_front, 
  not_null_collumns[length(not_null_collumns)], sep = ' ')

query_back <- paste('into', out_name, 'from', in_name , sep = ' ')

query_create <- paste(query_front, query_back, sep = ' ')

dbSendQuery(conn, query_create)
}

How can I make my solution faster?


Solution

  • You should query all at once:

    query_all <- paste('SELECT * FROM', in_name, ';', sep = ' ')
    
    vertica_answer <- dbGetQuery(conn, query_all)
    
    Vertica_sub <- vertica_answer[ , colSums(is.na(vertica_answer)) == 0]
    
    #or just non null columns
    
    non_null_cols <- names(verica_answer)[colSums(is.na(vertica_answer)) == 0]
    

    You are already querying the data in your loop. This is more efficient because there is overhead for each query that is run.