Search code examples
apache-sparkhiveapache-spark-sqlbigdatabucket

Get all record from nth bucket in Hive sql


How to get all record from nth bucket in hive.

Select * from bucketTable from bucket 9;


Solution

  • You can achieve this with different ways:

    Approach-1: By getting the table stored location from desc formatted <db>.<tab_name>

    Then read the 9th bucket file directly from HDFS filesystem.

    (or)

    Approach-2: Using input_file_name()

    Then filter only 9th bucket data by using filename

    Example:

    Approach-1:

    Scala:

    val df = spark.sql("desc formatted <db>.<tab_name>")
    
    //get table location in hdfs path
    val loc_hdfs = df.filter('col_name === "Location").select("data_type").collect.map(x => x(0)).mkString
    
    //based on your table format change the read format
    val ninth_buk = spark.read.orc(s"${loc_hdfs}/000008_0*")
    
    //display the data
    ninth_buk.show()
    

    Pyspark:

    from pyspark.sql.functions import *
    
    df = spark.sql("desc formatted <db>.<tab_name>")
    
    loc_hdfs = df.filter(col("col_name") == "Location").select("data_type").collect()[0].__getattr__("data_type")
    
    ninth_buk = spark.read.orc(loc_hdfs + "/000008_0*")
    
    ninth_buk.show()
    

    Approach-2:

    Scala:

     val df = spark.read.table("<db>.<tab_name>")
    
    //add input_file_name 
     val df1 = df.withColumn("filename",input_file_name())
    
    #filter only the 9th bucket filename and select only required columns
    val ninth_buk = df1.filter('filename.contains("000008_0")).select(df.columns.head,df.columns.tail:_*)
    
    ninth_buk.show()
    

    pyspark:

    from pyspark.sql.functions import *
    
     df = spark.read.table("<db>.<tab_name>")
    
    df1 = df.withColumn("filename",input_file_name())
    
    ninth_buk = df1.filter(col("filename").contains("000008_0")).select(*df.columns)
    
    ninth_buk.show()
    

    Approach-2 will not be recommended if you have huge data as we need to filter through whole dataframe..!!


    In Hive:

    set hive.support.quoted.identifiers=none;
    select `(fn)?+.+` from (
                            select *,input__file__name fn from table_name)e 
     where e.fn like '%000008_0%';