Search code examples
sqlapache-spark-sqlansi-sql

How to query text to find the longest prefix strings in SQL?


I am using sparq sql. Let's say this is a snapshot of my big table:

ups store
ups store austin
ups store chicago
ups store bern
walmart
target

How can I find the longest prefix for the above data in sql? That is:

 ups store
 walmart
 target

I already have a Java program to do this but I have a large file, now my question is if this could be reasonably done in SQL?

How about the following more complicated scnenario? (I can live without this but nice to have it if possible)

ups store austin
ups store chicago
ups store bern
walmart
target

and that would return [ups store, walmart, target].


Solution

  • Assuming you're free to create another table that simply has a list of ascending integers from zero up to the size of the longest possible string then the following should do the job using only ANSI SQL:

    SELECT
      id,
      SUBSTRING(name, 1, CASE WHEN number = 0 THEN LENGTH(name) ELSE number END) AS prefix
    FROM
     -- Join all places to all possible substring lengths.
     (SELECT *
      FROM places p
      CROSS JOIN lengths l) subq
    -- If number is zero then no prefix match was found elsewhere
    -- (from the question it looked like you wanted to include these)
    WHERE (subq.number = 0 OR
           -- Look for prefix match elsewhere
           EXISTS (SELECT * FROM places p
                   WHERE SUBSTRING(p.name FROM 1 FOR subq.number)
                         = SUBSTRING(subq.name FROM 1 FOR subq.number)
                     AND p.id <> subq.id))
      -- Include as a prefix match if the whole string is being used
      AND (subq.number = LENGTH(name)
           -- Don't include trailing spaces in a prefix
           OR (SUBSTRING(subq.name, subq.number, 1) <> ' '
               -- Only include the longest prefix match 
               AND NOT EXISTS (SELECT * FROM places p 
                               WHERE SUBSTRING(p.name FROM 1 FOR subq.number + 1)
                                     = SUBSTRING(subq.name FROM 1 FOR subq.number + 1)
                                 AND p.id <> subq.id)))
    ORDER BY id;
    

    Live demo: http://rextester.com/XPNRP24390

    The second aspect is that what if we have (ups store austin, ups store chicago). can we use SQL to extract the 'ups store' off of it.

    This should be simply a case of using SUBSTRING in a similar way to above, e.g:

    SELECT SUBSTRING(name,
                     LENGTH('ups store ') + 1,
                     LENGTH(name) - LENGTH('ups store '))
    FROM places
    WHERE SUBSTRING(name,
                    1,
                    LENGTH('ups store ')) = 'ups store ';