Search code examples
sqltextgoogle-bigquerytext-miningmining

SQL/BigQuery text classification


I need to implement a simple text classification using regex, and for this I thought to apply a simple CASE WHEN statement, but rather than in case one condition is met, I want to iterate over all the CASEs.

For example,

with `table` as(
SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
)
SELECT
  CASE
    WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI'
    WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering'
    WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning'
  END as topic,
  text
FROM `table`

With this query, the text is classified as AI, because it is the first condition that is met, but it should be classified as AI, Engineering and deep learning in an array or in three different rows, because all three conditions are met.

How can I classify the text applying all the regex/conditions?


Solution

  • I feel the below is the most generic and reusable solution (BigQuery Standard SQL):

    #standardSQL
    with `table` as(
    select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
    ), classification as (
      select 'ai' term, 'AI' topic union all
      select 'computational power', 'Engineering' union all
      select 'deep learning', 'Deep Learning'
    ), pattern as (
      select r'(?i)' || string_agg(term, '|') as regexp_pattern
      from classification
    )
    select
       array_to_string(array(
        select distinct topic
        from unnest(regexp_extract_all(lower(text), regexp_pattern)) term
        join classification using(term)
       ), ', ') topics,
      text
    from `table`, pattern
    

    With output:

    Enter image description here