Search code examples
heap-memoryparquetapache-drill

How to release heap memory on apache drill once the query is complete?


Problem is quite simple, every time I query on drill, the heap memory keeps on accumulating. My heap memory is 7 GBs but its not getting refreshed. After every 15 minutes I have to kill drill and start it again to clear the heap memory.

Current Config:

-) I am running apache drill on single node. Queries are executed on drill using the R package 'sergeant' and usually, parquet files are target files. Current OS is windows 7 Enterprise. -) We first build the query using src_drill and then use drl_con to execute the query. The architecture of building the query and then executing the query is a architecture choice as we want the application to be able to switch between different query engines, like sql, hive, spark etc.

library(sergeant)

# setting up drill query, I do not use collect() here
ds <- src_drill("localhost") 
query <- tbl(ds, "cp.`employee.json`") 
query %<>% dbplyr::sql_render()


# using drill con to execute the query
drl_con <- drill_connection("localhost") 
Mapping <- drill_query(drl_con, query, .progress = FALSE)

##  # A tibble: 100 x 16
##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>          <chr>    <chr>         <chr>      <chr>    
##   1 1           Sheri No… Sheri      Nowmer    1           President      0        1             1961-08-26 1994-12-…
##   2 2           Derrick … Derrick    Whelply   2           VP Country Ma… 0        1             1915-07-03 1994-12-…
##   3 4           Michael … Michael    Spence    2           VP Country Ma… 0        1             1969-06-20 1998-01-…
##   4 5           Maya Gut… Maya       Gutierrez 2           VP Country Ma… 0        1             1951-05-10 1998-01-…
##   5 6           Roberta … Roberta    Damstra   3           VP Informatio… 0        2             1942-10-08 1994-12-…
##   6 7           Rebecca … Rebecca    Kanagaki  4           VP Human Reso… 0        3             1949-03-27 1994-12-…
##   7 8           Kim Brun… Kim        Brunner   11          Store Manager  9        11            1922-08-10 1998-01-…
##   8 9           Brenda B… Brenda     Blumberg  11          Store Manager  21       11            1979-06-23 1998-01-…
##   9 10          Darren S… Darren     Stanz     5           VP Finance     0        5             1949-08-26 1994-12-…
##  10 11          Jonathan… Jonathan   Murraiin  11          Store Manager  1        11            1967-06-20 1998-01-…
##  # … with 90 more rows, and 6 more variables: salary <chr>, supervisor_id <chr>, education_level <chr>,
##  #   marital_status <chr>, gender <chr>, management_role <chr>


Ideally I would expect drill to do garbage collection on heap memory on its own after every query, but now its not happening.


Solution

  • Apache Drill has its own memory manager. On the task manager it never releases the heap memory but in the background it starts to reuse the heap memory once its full.

    If you are getting memory issues chances are you are going overboard some of the other memory parameters like total memory allotted to a single query, etc.

    Recycling of heap memory is not something that you should be worried about. Refer to: https://books.google.com.au/books?id=-Tp7DwAAQBAJ&printsec=frontcover&dq=apache+drill+nook&hl=en&sa=X&ved=0ahUKEwil7LeJuPzkAhXKZSsKHUDoBw4Q6AEIKjAA#v=onepage&q&f=false for more details