Search code examples
performancehadoopclouderaimpala

Slow performance on Impala query using Group By and Like


We are testing Apache Impala and have noticed that using GROUP BY and LIKE together works very slowly -- separate queries work much faster. Here are two examples:

# 1.37s 1.08s 1.35s

SELECT * FROM hive.default.pcopy1B where 
     (lower("by")  like '%part%' and lower("by")  like '%and%' and lower("by")  like '%the%') 
  or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') 
  or (lower(url)   like '%part%' and lower(url)   like '%and%' and lower(url)   like '%the%') 
  or (lower(text)  like '%part%' and lower(text)  like '%and%' and lower(text)  like '%the%') 
limit 100;

# 156.64s 155.63s

select "by", type, ranking, count(*) from pcopy where 
     (lower("by")  like '%part%' and lower("by")  like '%and%' and lower("by")  like '%the%') 
  or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') 
  or (lower(url)   like '%part%' and lower(url)   like '%and%' and lower(url)   like '%the%') 
  or (lower(text)  like '%part%' and lower(text)  like '%and%' and lower(text)  like '%the%') 
group by "by", type, ranking 
order by 4 desc limit 10;

Why does this issue occur, and are there any workarounds?


Solution

  • There is a basic difference between two queries.

    1st Query

    Main points:

    • Only 100 rows are selected.
    • As soon as process gets 100 rows which satisfies provided WHERE clause, it is marked as completed and 100 records will be returned.
    • There will be only 1 mapper step. Number of mapper will be dependent on your data size.

    2nd Query

    Main points:

    • Only 10 rows are selected.
    • Even though only 10 rows are selected, the process needs to scan full data in order to generate results based on GROUP BY clause.
    • There should be 3 mapper-reducer steps. Number of mapper-reducer on each step will dependent on data size.
      • 1st MP will read the data and apply WHERE clause
      • 2nd MR will be for GROUP BY clause.
      • 3rd MR will be for ORDER BY clause.

    So the queries provided by you might look similar but they are totally different and solves all together different purpose.