Search code examples
sqlsql-serversql-server-2000

How to extract file names from a field that contains html content in sql server?


We have a cms system that write html content blocks into sql server database. I know the table name and field name where these html content blocks reside. Some html contains links () to pdf files. Here is a fragment:

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>

I need to extract pdf file names from all such html content blocks. At the end I need to get a list:

Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf

of all pdf file names from that field.

Any help is appreciated. Thanks.

UPDATE

I have received many replies, thank you so much, but I forgot to mention that we are still using SQL Server 2000 here. So, this had to be done using SQL 2000 SQL.


Solution

  • Well it's not pretty but this works using standard Transact-SQL:

    SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
                THEN SUBSTRING(
                         html,
                         CHARINDEX('.pdf', html) -
                         PATINDEX(
                             '%["/]%',
                             REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
                         PATINDEX(
                             '%["/]%',
                             REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
                ELSE NULL
           END AS filename
    FROM mytable
    

    Could expand the list of delimiting characters before the filename from ["/] (which matches either a quotation mark or slash) if you like.

    See SQL Fiddle demo