Search code examples
bashhive

How can I calculate the size of hive table with specific partitions filter?


Background. I can calculate size of my hive partitioned table like below. <i.e. finding size of all partition and then sum-up sizes of needed partitions>

hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1'
100   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-06
200   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-07
300   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-08
400   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-09
500   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-10
600   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-11

I want to find out size of table between specific partitions <i.e. dt >= 2021-10-09 >

i.e. size of test.tbl1 between dt=2021-10-09 to dt=2021-10-11 partitions is = 400+500+600 = 1500 Bytes

Is there any way to filter out result during dfs -du phase to avoid manual summation?


Solution

  • With a POSIX awk (e.g. GNU or BSD awk) you can pre-process the last field ($NF) of each line with the gsub function to keep only the date. You can then compute the sum of the first field ($1) if and only if the last field ($NF) is in a defined range. Example for dt >= 2021-10-09:

    $ hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1' |
      awk -v dtmin='2021-10-09' '
        {gsub(/.*=/,"",$NF)} $NF>=dtmin{sum+=$1} END{print sum}'
    1500
    

    For a closed range:

    $ hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1' |
      awk -v dtmin='2021-10-06' -v dtmax='2021-10-08' '
        {gsub(/.*=/,"",$NF)} $NF>=dtmin && $NF<=dtmax{sum+=$1} END{print sum}'
    600