Search code examples
snowflake-cloud-data-platformjaro-winklermatch-recognize

FIRST() and LAST() for MATCH_RECOGNIZE


We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over. I am using MATCH_RECOGNIZE for this. It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly. Here is sample dataset:

enter image description here

I am using the following Query:

SELECT 
  USERID
  , NUM_OF_TWEETS
  , FIRST_TWEET
  , LAST_TWEET
  , FIRST_TWEET_ID
  , LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
    
  ONE ROW PER MATCH
  PATTERN (SIMILAR+)
  DEFINE
    SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    
);

This correct identify the users that are posting same tweets over an over: enter image description here

But I am not able to get the first tweet and the last tweet in the matching sequence.


Solution

  • There are multiple things at play.

    The first is you only "have one row trigging a match" so first and last are the second row of you data. This can be seen by changing to ALL ROWS PER MATCH

    with tweets(userid, tweetid, tweet) as (
        select * from values
        ('elena', 1, 'aaa'),
        ('elena', 2, 'aaaa')
    )
    SELECT 
    *
    FROM TWEETS
    MATCH_RECOGNIZE(
      PARTITION BY USERID
      ORDER BY TWEETID ASC
      MEASURES
        match_number() as match_number,
        FIRST(TWEET) AS FIRST_TWEET,
        LAST(TWEET) AS LAST_TWEET,
        FIRST(TWEETID) AS FIRST_TWEET_ID,
        LAST(TWEETID) AS LAST_TWEET_ID,
        COUNT(*) AS NUM_OF_TWEETS
      ALL ROWS PER MATCH
      PATTERN (SIMILAR+)
      DEFINE
        SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    );
    
    USERID TWEETID TWEET MATCH_NUMBER FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS
    elena 2 aaaa 1 aaaa aaaa 2 2 1

    if you change to say a match that catches the first value and the lag values:

      ALL ROWS PER MATCH
      PATTERN (SIMILAR_before SIMILAR_after+)
      DEFINE
        SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
        SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    

    you now match both the first and latter rows..

    USERID TWEETID TWEET MATCH_NUMBER FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS
    elena 1 aaa 1 aaa aaa 1 1 1
    elena 2 aaaa 1 aaa aaaa 1 2 2

    now if we expand our test a little bit more with four rows of data:

    with tweets(userid, tweetid, tweet) as (
        select * from values
        ('elena', 1, 'aaa'),
        ('elena', 2, 'aaaa'),
        ('elena', 3, 'aaa'),
        ('elena', 4, 'aaaa')
    )
    
    USERID TWEETID TWEET MATCH_NUMBER FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS
    elena 1 aaa 1 aaa aaa 1 1 1
    elena 2 aaaa 1 aaa aaaa 1 2 2
    elena 3 aaa 1 aaa aaa 1 3 3
    elena 4 aaaa 1 aaa aaaa 1 4 4

    we see those values are not double registering..

    BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.

    If we flip back to one row per match we do how ever get the results we are expecting.

    with tweets(userid, tweetid, tweet) as (
        select * from values
        ('elena', 1, 'aaa'),
        ('elena', 2, 'aaaa'),
        ('scott', 3, 'aaaa'),
        ('eva', 4, 'bbbb'),
        ('eva', 5, 'bbbbb'),
        ('amy', 4, 'eeee'),
        ('amy', 5, 'zzzz')
    )
    SELECT 
     USERID
      , NUM_OF_TWEETS
      , FIRST_TWEET
      , LAST_TWEET
      , FIRST_TWEET_ID
      , LAST_TWEET_ID
    FROM TWEETS
    MATCH_RECOGNIZE(
      PARTITION BY USERID
      ORDER BY TWEETID ASC
      MEASURES
        match_number() as match_number,
        FIRST(TWEET) AS FIRST_TWEET,
        LAST(TWEET) AS LAST_TWEET,
        FIRST(TWEETID) AS FIRST_TWEET_ID,
        LAST(TWEETID) AS LAST_TWEET_ID,
        COUNT(*) AS NUM_OF_TWEETS
      ONE ROW PER MATCH
      PATTERN (SIMILAR_before SIMILAR_after+)
      DEFINE
        SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
        SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    );
    
    USERID NUM_OF_TWEETS FIRST_TWEET LAST_TWEET FIRST_TWEET_ID LAST_TWEET_ID
    elena 2 aaa aaaa 1 2
    eva 2 bbbb bbbbb 4 5