Search code examples
unicodeamazon-athenaprestotrino

Find and remove non-printing characters from data


I was expecting both the records to return in case of following query. But I got back only second record.

WITH students_results(student_id, result) AS (VALUES
(1,  '️सिनेमा'  ),
(2, 'सिनेमा'))
SELECT
    student_id,
    result
FROM students_results
where result = 'सिनेमा'

The first record has a non printing unicode at the beginning \ufe0f

How do I find and remove such characters from data?


Solution

  • This is how you find such data.

    WITH students_results(student_id, result) AS (VALUES
     (1, '️सिनेमा'),
     (2, 'सिनेमा')
    )
    SELECT student_id
    FROM students_results
    WHERE char2hexint(result) like '%FE0F%'
    

    As for removing it, this is not quite what Athena is for. As you will probably be aware, it's an analytics, not an ETL service. I would advise to cleanse your data using AWS Glue, or even locally if it's applicable to your use case.