Search code examples
sqlcsvself-joinwindows-subsystem-for-linuxapache-drill

Self-join fails with where-clause


I am using apache drill to sum YTD (year to date) in a CSV-file using a self-referential join. The (shortened) query is

select
  ... fields from table a ...
  a.PeriodAmount,
  sum(cast(b.PeriodAmount as dec(18,3))) as YTDAmount
from dfs.`/home/foo/data/a.csv` a
  left join dfs.`/home/foo/data/a.csv` b
on
  ... join-conditions ...
*** where a.Year = '2018' ***
group by
  ... group-conditions ...
order by
  ... order-conditions ...
;

The query works without the where-clause. When the where-clause is included on the same dataset I get the following error:

Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join
[Error Id: b62e6b63-eda7-4a52-8f95-2499a1f5c278 on foo:31010] (state=,code=0)

I can circumvent the error by removing the where-clause and perform a subquery instead:

from (select * from dfs.`/home/foo/data/a.csv` where Year = '2017') a
  from (select * from dfs.`/home/foo/data/a.csv` where Year = '2017') b

But I am not sure that this is the proper approach. It makes the query more prone to errors since the same condition must be applied to more than one subquery rather than have it as a where-clause where it naturally belongs.

Can this self-join be rewritten so the where-clause is maintained?

This is on ubuntu 16.04 using WSL on win10 and apache drill is ver. 1.13.

Complete (working on drill) query:

select
  a.Dep_id,
  a.Dep,
  substr(a.Post_id, 1, 4) as Kap,
  a.Post_id,
  substr(a.Post_id, 5, 2) as Post,
  a.Art_id,
  a.Art,
  a.V_id,
  a.Reg,
  a.Dep_V_id,
  a.Dep_V,
  concat(substr(a.Periode, 1, 4), '-', substr(a.Periode, 5, 2), '-15') as PeriodDate,
  a.Period,
  a.Year,
  a.PeriodAmount,
  sum(cast(b.PeriodAmount as dec(18,3))) as YTDAmount
from dfs.`/home/foo/data/a.csv` a
  left join dfs.`/home/foo/data/a.csv` b
on
  a.Dep_id = b.Dep_id
  and a.Post_id = b.Post_id
  and a.Post_id is not null
  and a.Art_id = b.Art_id
  and a.V_id = b.V_id
  and a.Reg = b.Reg
  and a.Dep_V_id = b.Dep_V_id
  and a.Dep_id = b.Dep_id
  and b.Period <= a.Period
  and a.Year = b.Year
  and a.Post_id = b.Post_id
  and a.Art_id = b.Art_id
where a.Year in ('2018') and b.Year in (a.Year)
group by
  a.Dep_id,
  a.Dep,
  a.Post_id,
  a.Art_id,
  a.Art,
  a.V_id,
  a.Reg,
  a.Dep_V_id,
  a.Dep_V,
  a.Dep_id,
  a.Period,
  a.Year,
  a.PeriodAmount
order by
  a.Year,
  a.Dep_id,
  a.Post_id,
  a.Art_id,
  a.V_id,
  a.Reg,
  a.Dep_V_id,
  a.Dep_id,
  a.Period,
  a.PeriodAmount
;

Solution

  • I haven’t worked with querying csv files like this so this is more of a suggestion to try out.

    What about completing the where clause for both a and b to help the compiler like this

    WHERE a.Year = ‘2018’ AND b.Year = ‘2018’
    

    Or

    WHERE a.Year = ‘2018’ AND b.Year = a.Year