Search code examples
sqlsnowflake-cloud-data-platformdata-cleaning

Cleaning Varchar data in SQL Snowflake


I'm trying to clean some varchar data on Snowflake and having some issues on this column - I'd like all data where information is missing to display as null, rather than eg. 'unknown'. The data looks like this:

+---------------------------+
| Entity                    |
+---------------------------+
| Walgreens                 |
+---------------------------+
| Apple                     |
+---------------------------+
| Microsoft                 |
+---------------------------+
| 2018 Quora hack           |
+---------------------------+
| Unknown government agency |
+---------------------------+
| Unknown                   |
+---------------------------+

And I'd like to standardised it, either by changing the original column or adding a revised ones, so it looks like this:

+-----------+
| Entity    |
+-----------+
| Walgreens |
+-----------+
| Apple     |
+-----------+
| Microsoft |
+-----------+
| Quora     |
+-----------+
| null      |
+-----------+
| null      |
+-----------+

Here's what I've tried so far. The plan was to find something that would work for the 'Unknown' bits of data and then apply it to more specific cases like simplifying the '2018 Quora hack.'

1

select *
from data_breaches
order by case when "Entity" like '%nknown%'
then NULL else "Entity" end

This returned the data, but put entities which said 'Unknown' in them at the end of the table and didn't change them to null

2

select "Sector", "Records Number", "Method"
if "Entity" IN('Unknown'), NULL, "Entity") as Enclean
from data_breaches

Returned this error: Syntax error: unexpected '"Entity"'. (line 2)

I think maybe Snowflake doesn't support this syntax?

3

select "Year", "Records", "Organization type","Method"
iff("Entity" like '%nknown', NULL,"Entity")
from data_breaches

Returned this error: Syntax error: unexpected '('. (line 2)


Solution

  • Using ILIKE and CASE expression to handle Unknown inside column:

    SELECT CASE WHEN NOT Entity ILIKE '%Unknown%' THEN Entity END AS Entity
    FROM data_breaches;