Search code examples
apache-flinkflink-sql

Flink-SQL: Extract values from nested objects


i'm using Flink SQL and the following scheme shows my source data (belonging to some Twitter data):

CREATE TABLE `twitter_raw` (
    `entities` ROW(
        `hashtags` ROW(
            `text` STRING,
            `indices` INT ARRAY
        ) ARRAY,
        `urls` ROW(
            `indices` INT ARRAY,
            `url` STRING,
            `display_url` STRING,
            `expanded_url` STRING
        ) ARRAY,
        `user_mentions` ROW(
            `screen_name` STRING,
            `name` STRING,
            `id` BIGINT
        ) ARRAY
    )
)
WITH (...);

I want to get only the hashtags in a collection. Therefore i have to map the collection of constructed objects (ROW) to an array of STRING.

Like this scheme:

CREATE TABLE `twitter_raw` (
    `entities` ROW(
        `hashtags` STRING ARRAY,
        `urls` STRING ARRAY,
        `user_mentions` STRING ARRAY
    )
)
WITH (...);

How can i achieve this with Flink-SQL? Maybe built-in functions (JSON-functions?) or own UDF or do i have to write a DataStream Job?

Thanks in advance.


Solution

  • The SQL command UNNEST helps in this case. It is like EXPLODE in Spark.

    You can solve it by creating a new row for each hashtag in the hashtags array:

    SELECT hashtag, index
    FROM twitter_raw 
    CROSS JOIN UNNEST(hashtags) AS t (hashtag, index)