Search code examples
bashshellhivesh

Shell Script to find empty Hive databases


I am working on an audit process to delete empty Hive databases. I have a large number of databases that I need to go through and would like to use a shell script (.sh) in Linux that can run hive -e queries to identify empty databases and list them in some output file or log (wondering if a .txt file is an option?). I would then send this list to our admins to "drop" those empty databases. All of our databases follow the exact same naming convention: "environment" and "area" are always used... only "state" varies. >>> environment_area_<state>

Right now I am using the following queries to get the job done but it's very manual and very slow... I would end up spending a lot of time on the Linux command line.

I connect to Hive first in PuTTY, once connected, I run:

show databases;
use environment_area_<state>;
show tables;

If no tables show in the database, I add it to my list of databases that need to be deleted. I run the "use" and "show tables" queries over and over for every database.

As you can tell, this is a very time consuming approach and creating a shell script would really help.

I have searched online and watched a number of YouTube tutorials, but have not come across a use case that would help me out. Hoping someone much more experienced with shell scripts could help me get beyond #!/bin/bash followed by the queries I listed above.


Solution

  • Keeping hive answering commands in the background may improve performance significantly:

    #!/usr/bin/env bash
    
    tempdir=$(mktemp -d)
    # Cleanup at end of execution
    trap 'rm -fr -- "$tempdir";exit' EXIT INT
    
    hivein="$tempdir/hivein"
    hiveout="$tempdir/hiveout"
    mkfifo "$hivein" "$hiveout"
    
    # Prepare file descriptors IO to talk to hive
    exec 3<>"$hivein"
    exec 4<>"$hiveout"
    
    # Launch hive in the background
    hive -S <&3 >&4 &
    
    # Initialise hive
    printf '%s\n' 'set hive.cli.print.header=false;' >&3
    
    # Wait for hive response and get databases list
    printf '%s\n' "SHOW DATABASES LIKE 'environment_area_*';" >&3
    mapfile -u 4 -t databases
    
    empty_databases=()
    
    for db in "${databases[@]}"; do
      printf 'USE %s; SHOW TABLES;\n' "$db" >&3
      mapfile -u 4 -t tables
      tbl_count="${#tables[@]}"
      printf 'Database %s contains %d tables.\n' "$db" "$tbl_count"
    
      if [ "$tbl_count" -eq 0 ]; then
        # record empty db
        empty_databases+=("$db")
      fi
    done
    
    # Close the hive-cli in case closing the file descriptors is not enough
    printf '%s\n' '!exit' >&3
    
    printf '%s\n' "${empty_databases[@]}" >empty_databases_list.txt