Search code examples
databricksdata-cleaningazure-databricksdelta-lakedatabricks-sql

How to Check Which Record is non-numeric in a String Column in Delta Table


I am working on Delta table using Databricks on Azure.

The Delta table contains about 100 million records with many columns. One column data type of which is String, but containing large numbers (BIGINT). There are just some records that have some random non-numeric values.

Question: Is there any way to Select these non-numeric records from this table via Databricks SQL notebook?

Data Example:

A B C D E ...
'838472910' ... ... ... ... ...
'392018' ... ... ... ... ...
'10293849021' ... ... ... ... ...
'NULL' ... ... ... ... ...
'9384038' ... ... ... ... ...
'21314' ... ... ... ... ...
'UNMAPPED' ... ... ... ... ...
'3840594739' ... ... ... ... ...
'UNIDENTIFIED' ... ... ... ... ...
'29380072' ... ... ... ... ...
'592812012' ... ... ... ... ...
'8432178930' ... ... ... ... ...
'EMPTYVALUE' ... ... ... ... ...

Solution

  • There are several possibilities for that:

    1. Try to cast value to bigint, and select rows where result of cast is nil:
    select * from test where cast(data as bigint) is null
    
    1. Use regexp operator (doc) to perform check:
    select * from test where data not regexp '^[0-9]+$'
    

    Both approaches are giving the same result, although I think that first one would be a bit faster (although it needs testing to confirm that).