Mockup code for my problem:
SELECT Id FROM Tags WHERE TagName IN '<osx><keyboard><security><screen-lock>'
I am trying to get tags used in 2011 from apple.stackexchange data. (this query)
As you can see, tags in tag changes are stored as plain text in the Text
field.
<tag1><tag2><tag3>
<osx><keyboard><security><screen-lock>
How can I create a unique list of the tags, to look them up in the Tags
table, instead of this hardcoded version:
SELECT * FROM Tags
WHERE TagName = 'osx'
OR TagName = 'keyboard'
OR TagName = 'security'
Here is a interactive example.
Stackexchange uses T-SQL, my local copy is running under postgresql using Postgres app version 9.4.5.0.
Assuming this table definition:
CREATE TABLE posthistory(post_id int PRIMARY KEY, tags text);
Depending on what you want exactly:
To convert the string to an array, trim leading and trailing '<>', then treat '><' as separator:
SELECT *, string_to_array(trim(tags, '><'), '><') AS tag_arr
FROM posthistory;
To get list of unique tags for whole table (I guess you want this):
SELECT DISTINCT tag
FROM posthistory, unnest(string_to_array(trim(tags, '><'), '><')) tag;
The implicit LATERAL
join requires Postgres 9.3 or later.
This should be substantially faster than using regular expressions. If you want to try regexp, use regexp_split_to_table()
instead of regexp_split_to_array()
followed by unnest()
like suggested in another answer:
SELECT DISTINCT tag
FROM posthistory, regexp_split_to_table(trim(tags, '><'), '><') tag;
Also with implicit LATERAL
join. Related:
To search for particular tags:
SELECT *
FROM posthistory
WHERE tags LIKE '%<security>%'
AND tags LIKE '%<osx>%';
Applied to your search in T-SQL in our data explorer:
SELECT TOP 100
PostId, UserId, Text AS Tags FROM PostHistory
WHERE year(CreationDate) = 2011
AND PostHistoryTypeId IN (3 -- initial tags
, 6 -- edit tags
, 9) -- rollback tags
AND Text LIKE ('%<' + ##TagName:String?postgresql## + '>%');
(T-SQL syntax uses the non-standard +
instead of ||
.)
https://data.stackexchange.com/apple/query/edit/417055