I was expecting both the records to return in case of following query. But I got back only second record.
WITH students_results(student_id, result) AS (VALUES
(1, '️सिनेमा' ),
(2, 'सिनेमा'))
SELECT
student_id,
result
FROM students_results
where result = 'सिनेमा'
The first record has a non printing unicode at the beginning \ufe0f
How do I find and remove such characters from data?
This is how you find such data.
WITH students_results(student_id, result) AS (VALUES
(1, '️सिनेमा'),
(2, 'सिनेमा')
)
SELECT student_id
FROM students_results
WHERE char2hexint(result) like '%FE0F%'
As for removing it, this is not quite what Athena is for. As you will probably be aware, it's an analytics, not an ETL service. I would advise to cleanse your data using AWS Glue, or even locally if it's applicable to your use case.