I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1
(the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working? Thanks!
This is speculation. But I note that your CTE is using limit
with no order by
. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id
in the subquery.