Search code examples
sqlpostgresqlset-returning-functionsdataexplorer

Get each <tag> in String - stackexchange database


Mockup code for my problem:

SELECT Id FROM Tags WHERE TagName IN '<osx><keyboard><security><screen-lock>'

The problem in detail

I am trying to get tags used in 2011 from apple.stackexchange data. (this query)

As you can see, tags in tag changes are stored as plain text in the Text field. example output with stackexchange tags

<tag1><tag2><tag3>
<osx><keyboard><security><screen-lock>

How can I create a unique list of the tags, to look them up in the Tags table, instead of this hardcoded version:

SELECT * FROM Tags
  WHERE TagName = 'osx' 
     OR TagName = 'keyboard' 
     OR TagName = 'security'

Here is a interactive example.

Stackexchange uses T-SQL, my local copy is running under postgresql using Postgres app version 9.4.5.0.


Solution

  • Assuming this table definition:

    CREATE TABLE posthistory(post_id int PRIMARY KEY, tags text);
    

    Depending on what you want exactly:

    To convert the string to an array, trim leading and trailing '<>', then treat '><' as separator:

    SELECT *, string_to_array(trim(tags, '><'), '><') AS tag_arr
    FROM   posthistory;
    

    To get list of unique tags for whole table (I guess you want this):

    SELECT DISTINCT tag
    FROM   posthistory, unnest(string_to_array(trim(tags, '><'), '><')) tag;
    

    The implicit LATERAL join requires Postgres 9.3 or later.

    This should be substantially faster than using regular expressions. If you want to try regexp, use regexp_split_to_table() instead of regexp_split_to_array() followed by unnest() like suggested in another answer:

    SELECT DISTINCT tag
    FROM   posthistory, regexp_split_to_table(trim(tags, '><'), '><') tag;
    

    Also with implicit LATERAL join. Related:

    To search for particular tags:

    SELECT *
    FROM   posthistory
    WHERE  tags LIKE '%<security>%'
    AND    tags LIKE '%<osx>%';
    

    SQL Fiddle.

    Applied to your search in T-SQL in our data explorer:

    SELECT TOP 100
           PostId, UserId, Text AS Tags FROM PostHistory
    WHERE  year(CreationDate) = 2011
    AND    PostHistoryTypeId IN (3  -- initial tags
                               , 6  -- edit tags
                               , 9) -- rollback tags
    AND    Text LIKE ('%<' + ##TagName:String?postgresql## + '>%');
    

    (T-SQL syntax uses the non-standard + instead of ||.)
    https://data.stackexchange.com/apple/query/edit/417055