Search code examples
sqlamazon-web-servicesprestoamazon-athenatrino

Athena/Presto | Can't match ID row on self join


I'm trying to get the bi-grams on a string column.

I've followed the approach here but Athena/Presto is giving me errors at the final steps.

Source code so far

with word_list as (
    SELECT 
      transaction_id, 
      words, 
      n, 
      regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
      f70_remittance_info
    FROM exploration_transaction
    cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
    where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
    and f70_remittance_info is not null
    limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2 
on wl1.transaction_id = wl2.transaction_id

The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.

Athena doesn't support the ngrams function by presto, so I'm left with this approach.

Any clues why this isn't working? Thanks!


Solution

  • This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.

    Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.

    One solution would be to add order by transacdtion_id in the subquery.